Psychometric Testing

Psychometric testing
Module 1: Introduction History of Psychological Testing Meaning, Definition and Types of

Psychological Testing. Ethical issues in Psychological Testing
In probability theory and statistics, variance is the expectation of the squared deviation of a
random variable from its mean. Informally, it measures how far a set of (random) numbers
are spread out from their average value.
Ethical issues in Psychological Testing

One of the major ethical issues associated with psychological testing is the privacy issue. Any
psychological test is carried out with the implicit understanding that the findings of the test
will not be divulged to any other third parties (McIntire & Miller, 2007). A case on point is
the use of employees’ results from psychological tests by employers. Teachers may also use
the results of psychological tests from students without consent of the students.
Another ethical issue relating to psychological testing is the purpose for which the findings
of the psychological tests are used (Beck et al, 1996). Ethical principles require that the
purpose of the test be made known to the client. This is not so and usually many of these
results are used for purposes other than those specified. This is unethical, for the reasons
for which an individual agreed to take a psychological test may be different from those that
the findings are actually used for.
Employers use psychological tests to determine who is most qualified for a particular job
from a pool of many. Questions have been raised on the legality of the use of the tests in
such a process (Plante, 2005). Some even view this as a form of discrimination in the
workplace. Some employers also use psychological tests to determine successful candidates
for promotion and many view this as a wrong choice of method to use in making such
decisions.
Psychological tests are used in criminal justice. The popular lie detector, also known as a
polygraph, is an example of psychological testing where authorities try to detect lies from
suspected criminals. Questions are raised about the accuracy of the tests and there is
always the fear that the tests may give wrong results on an innocent individual.
Criterion Reference test vs Normative Reference test

Norm-referenced refers to standardized tests that are designed to compare and rank test
takers in relation to one another. Norm-referenced tests report whether test takers
performed better or worse than a hypothetical average student, which is determined by
comparing scores against the performance results of a statistically selected group of test
takers, typically of the same age or grade level, who have already taken the exam. Norm-
referenced tests often use a multiple-choice format, though some include open-ended,
short-answer questions. They are usually based on some form of national standards, not
locally determined standards or curricula. IQ tests are among the most well-known norm-
referenced tests, as are developmental-screening tests, which are used to identify learning
disabilities in young children or determine eligibility for special-education services. A few
major norm-referenced tests include the California Achievement Test, Iowa Test of Basic
Skills, Stanford Achievement Test, and TerraNova.
Criterion-referenced test results are often based on the number of correct answers provided
by students, and scores might be expressed as a percentage of the total possible number of
correct answers. On a norm-referenced exam, however, the score would reflect how many
more or fewer correct answers a student gave in comparison to other students. Norm-
referenced tests (or NRTs) compare an examinee’s performance to that of other examinees.
Standardized examinations such as the SAT are norm-referenced tests. The goal is to rank
the set of examinees so that decisions about their opportunity for success (e.g. college
entrance) can be made. Criterion-referenced tests (or CRTs) differ in that each examinee’s
performance is compared to a pre-defined set of criteria or a standard. The goal with these
tests is to determine whether or not the candidate has the demonstrated mastery of a
certain skill or set of skills. These results are usually “pass” or “fail” and are used in making
decisions about job entry, certification, or licensure
28th august
Brackets for intelligence
Wechsler (WAIS–III) 1997 IQ test classification
IQ Range ("deviation IQ") IQ Classification
130 and above Very superior
120–129 Superior
110–119 High average
90–109 Average
80–89 Low average
70–79 Borderline
69 and below Extremely low
There are many advantages of standardized testing:

1. Standardized tests are practical, they're easy to administer, and they consume less
time to administer versus other assessments.
2. Standardized testing results are quantifiable. By quantifying students' achievements,
educators can identify proficiency levels and more easily identify students in need of
remediation or advancement.
3. Standardized tests are scored via computer, which frees up time for the educator.
4. Since scoring is completed by computer, it is objective and not subject to educator
bias or emotions.
5. Standardized testing allows educators to compare scores to students within the
same school and across schools. This information provides data on not only the
individual student's abilities but also on the school as a whole. Areas of school-wide
weaknesses and strengths are more easily identifiable.
There are disadvantages of standardized testing.
1. Standardized test items are not parallel with typical classroom skills and behaviors.
Because questions must be generalizable to the entire population, most items assess
general knowledge and understanding.
2. Since general knowledge is assessed, educators cannot use standardized test results
to inform their individual instruction methods. If recommendations are made,
educators may begin to 'teach to the test' as opposed to teaching what is currently
in the curriculum or based on the needs of their individual classroom.
3. Standardized test items do not assess higher-level thinking skills.
4. These tests are influenced by nonacademic factors such as fatigue and attention.
3rd Neuropsychological test

The Clinical Dementia Rating or CDR is a numeric scale used to quantify the severity of
symptoms of dementia. Using a structured-interview protocol developed by Charles Hughes,
Leonard Berg, John C. Morris and it assesses a patient's cognitive and functional
performance in six areas: memory, orientation, judgment & problem solving, community
affairs, home & hobbies, and personal care. Scores in each of these are combined to obtain
a composite score ranging from 0 through 3. Clinical Dementia Rating Assignment
Qualitative equivalences are as follows: NACC Clinical Dementia Rating
Composite Rating Symptoms

0 none
0.5 very mild
1 mild
2 moderate
3 severe
CDR is credited with being able to discern very mild impairments, but its weaknesses include
the amount of time it takes to administer, its ultimate reliance on subjective assessment,
and relative inability to capture changes over time.
Different stressors faced during transition periods

Life changes are inevitable. Whether it’s a job change, the beginning or ending of a
relationship, starting a family, or a loss, transitions are part of the human experience – yet
can often be difficult to adapt to. In order to cope with these changes, many of us find
ourselves in a “fight, flight or freeze response.” For example, a difficult transition may cause
us to get angry, to compartmentalize our feelings or avoid them all together. We may feel
like we’re unable to move forward – frozen with worry and fear.
Pluri potentiality refers to the multiple, functional role of the brain. That is, any given area of
the brain can be involved in relatively few or relatively many behaviors.
11/9 hw
Cross validation and publishing a test (pg 128 to 130 in psychological testing textbook)
Cross validation refers to the practice of using original regression equation in a new sample
to determine whether the test predicts the criterion as well as it did in the original sample.
Binet Analysis test
The Binet-Simon Scale was developed by Alfred Binet and his student Theodore Simon. The
Stanford-Binet test is meant to gauge and analyze intelligence through five factors of
cognitive ability. These five factors include fluid reasoning, knowledge, quantitative
reasoning, visual-spatial processing and working memory. Both verbal and nonverbal
responses are measured. Each of the five factors is given a weight and the combined score is
often reduced to a ratio known commonly as the intelligence quotient, or IQ. The Stanford-
Binet test is the reason we have the IQ scale we are most familiar with today, and the one
most high-IQ societies base their admission threshold by. The test is among the most
reliable standardized tests currently used in education. It has undergone many validity tests
and revisions throughout its century-long history.
Module 2: Measurement Nature and significance of Measurement Distinction between

assessment and measurement Levels of measurement Techniques of Attitude Measurement
Measurement is the assignment of scores to individuals so that the scores represent some
characteristic of the individuals. Psychological measurement is often referred to as
psychometrics. imagine a clinical psychologist who is interested in how depressed a person
is. He administers the Beck Depression Inventory, which is a 21-item self-report
questionnaire in which the person rates the extent to which he or she has felt sad, lost
energy, and experienced other symptoms of depression over the past 2 weeks. The sum of
these 21 ratings is the score and represents his or her current level of depression. The
important point here is that measurement does not require any particular instruments or
procedures. It requires some systematic procedure for assigning scores to individuals or
objects so that those scores represent the characteristic of interest. Many variables studied
by psychologists are straightforward and simple to measure. These include sex, age, height,
weight, and birth order. Other variables studied by psychologists—perhaps the majority—
are not so straightforward or simple to measure. We cannot accurately assess people’s level
of intelligence by looking at them, and we certainly cannot put their self-esteem on a
bathroom scale. These kinds of variables are called constructs (pronounced CON-structs)
and include personality traits (e.g., extroversion), emotional states (e.g., fear), attitudes
(e.g., toward taxes), and abilities (e.g., athleticism). Psychological constructs cannot be
observed directly. One reason is that they often represent tendencies to think, feel, or act in
certain ways. For example, to say that a particular college student is highly extroverted (see
Note 5.6 “The Big Five”) does not necessarily mean that she is behaving in an extroverted
way right now.
Distinction between Assessment and Measurement

Assessment can be defined as the process of gathering the data and fashioning them into
interpretable form for decision-making. It involves collecting data with a view of making
valve judgment about the quality of a person, object, group or event. Assessment in science
education can be defined as the use of various measurement techniques to determine the
extent to which learners’ programme to which they have been exposed. Educational
assessment provides the necessary feedback that is required in order to maximize the
outcome of educational effort.
Measurement, refers to the set of procedures and the principles for how to use the
procedures in educational tests and assessments. Some of the basic principles of
measurement in educational evaluations would be raw scores, percentile ranks, derived
scores, standard scores, etc.
Levels of Measurement (pg 114)

The psychologist S. S. Stevens suggested that scores can be assigned to individuals so that
they communicate more or less quantitative information about the variable of interest.
Stevens actually suggested four different levels of measurement (which he called “scales of
measurement”) that correspond to four different levels of quantitative information that can
be communicated by a set of scores.
1. The nominal level of measurement is used for categorical variables and involves
assigning scores that are category labels. Category labels communicate whether any
two individuals are the same or different in terms of the variable being measured.
For example, if you look at your research participants as they enter the room, decide
whether each one is male or female, and type this information into a spreadsheet,
you are engaged in nominal-level measurement.
2. The ordinal level of measurement involves assigning scores so that they represent
the rank order of the individuals. Ranks communicate not only whether any two
individuals are the same or different in terms of the variable being measured but
also whether one individual is higher or lower on that variable.
3. The interval level of measurement involves assigning scores so that they represent
the precise magnitude of the difference between individuals, but a score of zero
does not actually represent the complete absence of the characteristic. A classic
example is the measurement of heat using the Celsius or Fahrenheit scale. The
difference between temperatures of 20°C and 25°C is precisely 5°, but a temperature
of 0°C does not mean that there is a complete absence of heat. In psychology, the
intelligence quotient (IQ) is often considered to be measured at the interval level.
4. The ratio level of measurement involves assigning scores in such a way that there is a
true zero point that represents the complete absence of the quantity. Height
measured in meters and weight measured in kilograms are good examples. So are
counts of discrete objects or events such as the number of siblings one has or the
number of questions a student answers correctly on an exam.
Techniques of Attitude Measurement

Attitude measurement can be divided into two basic categories
 Direct Measurement (likert scale and semantic differential) given below

 Indirect Measurement (projective techniques)
Indirect methods typically involve the use of a projective test. A projective test is involves
presenting a person with an ambiguous (i.e. unclear) or incomplete stimulus (e.g. picture or
words). The stimulus requires interpretation from the person. Therefore, the person’s
attitude is inferred from their interpretation of the ambiguous or incomplete stimulus. The
assumption about these measures of attitudes it that the person will “project” his or her
views, opinions or attitudes into the ambiguous situation, thus revealing the attitudes the
person holds. However, indirect methods only provide general information and do not offer
a precise measurement of attitude strength since it is qualitative rather than quantitative.
This method of attitude measurement is not objective or scientific which is a big criticism.
Examples are Rorschach Inkblot Test and Thematic Apperception Test (or TAT)
13/9 -Types of Scales
1. Dichotomous Scales- The dichotomous question is a question which can have two
possible answers. Dichotomous questions are usually used in a survey that asks for a
Yes/No, True/False or Agree/Disagree answers. They are used for clear distinction of
qualities, experiences or respondent’s opinions. Dichotomous questions (Yes/No) may seem
simple, but they have few problems both on the part of the survey respondent and in terms
of analysis. Yes/No questions often force customers to choose between options that may
not be that simple, and may lead to a subject deciding on an option that doesn’t truly
capture their feelings.
Eg- Myers-Briggs Type Indicator (MBTI). MBTI reports tell you your preference for each of
four pairs: Extraversion or Introversion E or I. Sensing or Intuition S or N. Thinking or Feeling
T or F. Binary variables are a sub-type of dichotomous variable; variables assigned either a 0
or a 1 are said to be in a binary state. For example Male (0) and female (1). Dichotomous
variables can be further described as either a discrete dichotomous variable or a continuous
dichotomous variable. The idea is very similar to regular discrete variables and continuous
variables. When two dichotomous variables are discrete, there’s nothing in between them
and when they are continuous, there are possibilities in between. “Dead or Alive” is a
discrete dichotomous variable. You can only be dead. Or you can only be alive. “Passing or
Failing an Exam” is a continuous dichotomous variable. Grades on a test can range from 0 to
100% with every possible percentage in between. You could get 74% and pass. You could
get 69% and fail. Or a 69.5% and pass (if your professor rounds up!).
2 Ipsative scale- is a descriptor used in psychology to indicate a specific type of measure in
which respondents compare two or more desirable options and pick the one that is most
preferred (sometimes called a "forced choice" scale). An ipsative measurement presents
respondents with options of equal desirability; thus, the responses are less likely to be
confounded by social desirability. Respondents are forced to choose one option that is
“most true” of them and choose another one that is “least true” of them. A major
underlying assumption is that when respondents are forced to choose among four equally
desirable options, the one option that is most true of them will tend to be perceived as
more positive. Similarly, when forced to choose one that is least true of them, those to
whom one of the options is less applicable will tend to perceive it as less positive. For
example, consider the following: ipsative forms give the applicant a choice of 2-4 equally
positive statements, and they must give their preference or agreement to one of them. An
example being to choose from: “I enjoy social events” or “I like to keep organised”. This
forces the person think more about their answer, and hopefully answer more truthfully, as
there is not one obviously desirable quality to pick from.
The measurement dependency violates one of the basic assumptions of classical test theory
—independence of error variance—which has implications for the statistical analysis of
ipsative scores, as well as for their interpretation.
3. Q sort/rank order- The Q-Sort Scaling is a Rank order scaling technique where the
respondents are asked to sort the presented objects into piles based on similarity according
to a specified criterion such as preference, attitude, perception, etc. In other words, a
scaling technique in which the respondents sort the number of statements or attitudes into
piles, usually of 11, on the basis of some specified criterion. For example, suppose the
respondents are given 100 motivational statements on individual cards and are asked to
place these in 11 piles, ranging from the “most agreed with” to the “least agreed with”.
Generally, the most agreed statement is placed on the top while the least agreed statement
at the bottom.
4) Semantic Differential Scale- Semantic differentials can be used to measure opinions,

attitudes and values on a psychometrically controlled scale. The semantic differential
technique of Osgood et al. (1957) asks a person to rate an issue or topic on a standard set of
bipolar adjectives (i.e. with opposite meanings), each representing a seven-point scale. To
prepare a semantic differential scale, you must first think of a number of words with
opposite meanings that are applicable to describing the subject of the test.
For example, participants are given a word, for example 'car', and presented with a variety
of adjectives to describe it. Respondents tick to indicate how they feel about what is being
measured. For example, you could measure a person’s attitude to the word “Work” with the
following scale:
Boring------------------ Interesting
Unnecessary----------Necessary
The semantic differential technique reveals information on three basic dimensions of
attitudes: evaluation, potency (i.e. strength) and activity.
• Evaluation is concerned with whether a person thinks positively or negatively about the
attitude topic (e.g. dirty – clean, and ugly - beautiful).
• Potency is concerned with how powerful the topic is for the person (e.g. cruel – kind, and
strong - weak).
• Activity is concerned with whether the topic is seen as active or passive (e.g. active –
passive).
Using this information we can see if a person’s feeling (evaluation) towards an object is
consistent with their behavior. For example, a place might like the taste of chocolate
(evaluative) but not eat it often (activity). The evaluation dimension has been most used by
social psychologists as a measure of a person’s attitude, because this dimension reflects the
affective aspect of an attitude. Osgood's Semantic Differential was an application of his
more general attempt to measure the semantics or meaning of words, particularly
adjectives, and their referent concepts. The respondent is asked to choose where his or her
position lies, on a scale between two polar adjectives (for example: "Adequate-Inadequate",
"Good-Evil" or "Valuable-Worthless").
5) Likert Scales- pg 118
6) Thurstones method of equal appearing intervals- pg 117
7) Thurstones method of absolute scaling- pg 118
8) Guttman Scales- pg 118
9) The visual analogue scale- or visual analog scale (VAS) is a psychometric response scale
which can be used in questionnaires. is a measurement instrument that tries to measure a
characteristic or attitude that is believed to range across a continuum of values and cannot
easily be directly measured. When responding to a VAS item, respondents specify their level
of agreement to a statement by indicating a position along a continuous line between two
end-points. It is often used in epidemiologic and clinical research to measure the intensity or
frequency of various symptoms. For example, the amount of pain that a patient feels ranges
across a continuum from none to an extreme amount of pain.
Approaches to measure personality concepts
Test construction strategies are the various ways that items in a psychological measure are
created and decided upon. They are most often associated with personality tests, but can
also be applied to other psychological constructs such as mood or psychopathology. There
are three commonly used general strategies: Inductive, Deductive, and Empirical.
Inductive (Factor analysis) -The inductive method begins by constructing a wide variety of
items with little or no relation to an established theory or previous measure. The group of
items is then answered by a large number of participants and analysed using various
statistical methods, such as exploratory factor analysis or principal component analysis.
These methods allow researchers to analyse natural relationships among the questions and
then label components of the scale based on how the questions group together. The Five
Factor Model of personality was developed using this method. Advantages of this method
include the opportunity to discover previously unidentified or unexpected relationships
between items or constructs. Factor analysis is a statistical method used to describe
variability among observed, correlated variables in terms of a potentially lower number of
unobserved variables called factors. Factor analysis searches for such joint variations in
response to unobserved latent variables (are variables that are not directly observed but are
rather inferred (through a mathematical model) from other variables that are observed
(directly measured).
Logical Rational Deductive Approach- Also known as rational, intuitive, or deductive
method. The deductive method begins by developing a theory for the construct of interest.
This may include the use of a previously established theory. After this, items are created
that are believed to measure each facet of the construct of interest. After item creation,
initial items are selected or eliminated based upon which will result in the strongest internal
validity for each scale. Advantages of this method include clearly defined and face valid
questions for each measure. Measures are also more likely to apply across populations.
Items are related based on some theoretical framework. They simply explain or describe the
construct that has to be measured. For eg self-esteem item would be “I am confident”.
These items correlate well with one another as well as the total score. Homogeneity of the
test is assured.
Empirical criterion approach- Also known as External or Criterion Group method as well as
method of empirical keying. Empirical test construction attempts to create a measure that
differentiates between different established groups. For example, this may include
depressed and non-depressed individuals, or individuals high or low in levels of aggression.
Pg no 119.
Module 3: Construction of Test Steps of constructing a Psychological Test Reliability:

Meaning, types and factors affecting reliability Validity: Meaning, types and factors
affecting Validity. Characteristics of a good Psychological Test
Construct vs Concept vs Variable

1. The word ‘concept’ gives a vivid picture on something, which helps to understand the
category and diversity of particular related pragmatic phenomenon. Concepts are based on
our experiences. Concepts can be based on real phenomena and are a generalized idea of
something of meaning. Examples of concepts include common demographic measures:
Income, Age, Eduction Level, Number of SIblings.
2. CONSTRUCT The word ‘construct’ means focused abstract idea on something inferred
from an observable phenomenon.
3. VARIABLE ‘Variable’ means the factor or aspect of an issue or incident or a content which
should be able to be measured. Variable is based on values. It varies from incident to
incident, issue to issue. We may have example- if we are conducting a research on the
present condition of village, there the demographic profile, economic condition, health and
hygiene status could be considered as variables. Variables are measurements that are free
to vary. Variable can be divided into Independent Variables or Dependent Variables. A
dependent variable changes in response to changes in the independent variable or variable.
Reliability has 2 aspects

Temporal Consistency- It is consistency over time.
 Test-Rest Reliability
 Alternate-forms reliability
Inter item consistency- The degree to which every test item is measures the same construct.
 Split-Half reliability
 Coefficient Alpha
 Kuder-Richardson
 Interscorer Reliability
Factors influencing the reliability of test scores:
There are some intrinsic and extrinsic factors which affect the reliability of the test scores:
1. Length of the test: The reliability of the test increases with its length.
2. Speed: In a speed test, reliability will be problematic. This is because every student
cannot complete all of the items is a speed test. In contrast, a power test is a test in
which every student is able to complete all the items.
3. Group Homogeneity: The test is more reliable if the group of students on which the
test is administered is more heterogeneous.
4. Item Difficulty: The test items should have certain difficulty level so as to maintain
the reliability of the test i.e the items of the test should not be very easy or very
hard.
5. Objectivity: Objective test will have higher reliability compared to subjective test.
6. Variation with the testing situation: Deviation during the administration of the test
such as noise level and distraction can cause test scores to vary, which may affect
the reliability of the test.
Characteristics of a good Psychological Test

Five main characteristics of a good psychological test are as follows: 1. Objectivity 2.
Reliability 3. Validity 4. Norms 5. Practicability!
1. Objectivity: The test should be free from subjective—judgement regarding the ability,
skill, knowledge, trait or potentiality to be measured and evaluated.
2. Reliability: This refers to the extent to which they obtained results are consistent or
reliable. When the test is administered on the same sample for more than once with a
reasonable gap of time, a reliable test will yield same scores. It means the test is
trustworthy. There are many methods of testing reliability of a test.
3. Validity: It refers to extent to which the test measures what it intends to measure. For
example, when an intelligent test is developed to assess the level of intelligence, it should
assess the intelligence of the person, not other factors. Validity explains us whether the test
fulfils the objective of its development. There are many methods to assess validity of a test.
4. Norms: Norms refer to the average performance of a representative sample on a given
test. It gives a picture of average standard of a particular sample in a particular aspect.
Norms are the standard scores, developed by the person who develops test. The future
users of the test can compare their scores with norms to know the level of their sample.
5. Practicability: The test must be practicable in- time required for completion, the length,
number of items or questions, scoring, etc. The test should not be too lengthy and difficult
to answer as well as scoring.

Psychometric Testing

Uploaded by

Psychometric Testing

Uploaded by

Psychometric testing

Module 1: Introduction History of Psychological Testing Meaning, Definition and Types of

Ethical issues in Psychological Testing

Criterion Reference test vs Normative Reference test

There are many advantages of standardized testing:

3rd Neuropsychological test

Composite Rating Symptoms

Different stressors faced during transition periods

Module 2: Measurement Nature and significance of Measurement Distinction between

Distinction between Assessment and Measurement

Levels of Measurement (pg 114)

Techniques of Attitude Measurement

 Direct Measurement (likert scale and semantic differential) given below

4) Semantic Differential Scale- Semantic differentials can be used to measure opinions,

Module 3: Construction of Test Steps of constructing a Psychological Test Reliability:

Construct vs Concept vs Variable

Reliability has 2 aspects

Characteristics of a good Psychological Test

You might also like