Item response theory - Misplaced Pages

This is an old revision of this page, as edited by 211.26.31.142 (talk) at 12:23, 3 July 2005 (→Overview). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Revision as of 12:23, 3 July 2005 by 211.26.31.142 (talk) (→Overview)(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)

Overview

Item Response Theory (IRT) designates a body of related psychometric theory that provides a foundation for stating and testing the hypothesis that, upon interaction between a given person and assessment item, the probability of a discrete item response is governed by the location of the item and person with respect to a latent trait. For this reason, IRT may be regarded as roughly synonymous with Latent Trait Theory. The term latent is used to emphasise that discrete item responses are taken to be observable manifestations of the trait or attribute, the existence of which is hypothesized and must be inferred from the manifest responses. In addition to person and item locations, it is often posited that responses by persons to particular items are influenced by other properties, such as item discrimination.

IRT models are used as a basis for statistical estimation of parameters that represent the "locations" of persons and items on a latent continuum or, more correctly, the magnitude of the latent trait attributable to the persons and items. For example, in attainment testing, estimates may be of the magnitude of a person's ability within a specific domain, such as reading comprehension. Once estimates of relevant parameters have been obtained, statistical tests are usually conducted to gauge the extent to which the parameters predict item responses. Stated somewhat differently, such tests are used to ascertain the degree to which the parameters account for the structure of and statistical patterns within the response data, either as a whole, or by considering specific subsets of the data such as response vectors pertaining to individual items or persons. This approach permits the central hypothesis to be subjected to empirical testing, as well as providing information about the psychometric properties of a given assessment, and therefore also the quality of estimates.

From the perspective of more traditional approaches, such as Classical test theory, an advantage of IRT is that it potentially provides information that enables a researcher to improve the reliability of an assessment. IRT is sometimes referred to using the word strong as in strong true score theory or modern as in modern mental test theory because IRT is a more recent body of theory and makes more explicit the hypotheses that are implicit within Classical test theory.

IRT models

Much of the literature on IRT centers around item response models. A given IRT model constitutes a mathematized hypothesis that the probability of a discrete response to an item is a function of a person parameter (or, in the case of multidimensional item response theory, a vector of person parameters) and one or more item parameters. For example, in the 3 Parameter Logistic Model (3PLM), the probability of a correct response to an item i is:

$p_{i}({\theta })=c_{i}+{\frac {(1-c_{i})}{1+e^{-Da_{i}({\theta }-b_{i})}}}$

where ${\theta }$ is the person parameter and $a_{i}$ , $b_{i}$ , and $c_{i}$ are item parameters. The item parameter $a_{i}$ is a measure of the discrimating power of the item, i.e., similar to the classical test index point-biserial correlation. The $b_{i}$ parameter is the measure of the item difficulty, and for items, such as multiple choice, where there is a chance of guessing the correct answer, there is the $c_{i}$ parameter.

This logistic model relates the level of the person parameter and item parameters to the probability of responding correctly. The constant D has the value 1.702 which rescales the logistic function to closely approximate the cumulative normal ogive. (This model was originally developed using the normal ogive but the logistic model with the recaling provides virtually the same model while simplifying the computations greatly.)

The line that traces the probability for a given item across levels of the trait is called the item characteristic curve (ICC) or, less commonly, item response function.

The person parameter indicates the individual's standing in the latent trait. The estimate of the person parameter is the individual's test score. The latent trait is the human capacity measured by the test. It might be a cognitive ability, physical ability, skill, knowledge level, attitude, personality characteristic, etc. In a unidimensional model such as the one above, this trait is considered to be a single factor (as in factor analysis). Individual items or individuals might have secondary factors but these are assumed to be mutually independent and collectively othogonal.

The item parameters simply determine the shape of the ICC and in some cases may not have a direct interpretation. In this case, however, the parameters are commonly interpreted as follows. The b parameter is considered to index an item's difficulty. Note that this model scales the items's difficulty and the person's trait onto the same metric. Thus, it is valid to talk about an item being about as hard as Person A's trait level or of a person's trait level being about the same as Item Y's difficulty. The a parameter controls how steeply the ICC rises and thus indicates the degree to which the item distinguishes individuals with trait levels above and below the rising slope of the ICC. This parameter is thus called the item discrimination and is correlated with the item's loading on the underlying factor, with the item-total correlation, and with the index of discrimination. The final parameter, c, is the asympotote of the ICC on the left-hand side. Thus it indicates the probability that very low ability individuals will get this item correct by chance.

This model assumes a single trait dimension and a binary outcome; it is a dichotomous, unidimensional model. Another class of models preduct polytomous outcomes. And a class of models exist to predict response data that arise from multiple traits.

Note to reader: Below here, this article is still very much under construction

Information

One of the major contributions of item response theory is the extension or the concept of reliability. Traditionally, reliability refers to the precision of measurement (i.e., the degree to which measurement is free of error). And traditionally, it is measured using a single index defined in various ways, such as the ratio of true and observed score variance. This index is helpful in characterizing a test's average reliability, for example in order to compare two tests. But it is clear that reliability cannot be uniform across the entire range of test scores. Scores at the edges of the test's range, for example, are known to have more error than scores closer to the middle.

Item response theory advances the concept of item and test information to replace reliability. Information is also a function of the model parameters. The item information supplied by the one parameter Rasch model is simply the probability of a correct response multiplied by the probably of an incorrect response, or,

$I({\theta })=p_{i}({\theta })*q_{i}({\theta })$

The Standard Error of estimation (SE)is the reciprocal of the test information of at a given trait level, is the

$SE({\theta })=1/sqrt(I({\theta }))$

Thus more information implies less error of measurement.

For other models, such as the two and three parameters models, the discrimination parameter plays an inportant part in the function. The item information function for the two parameter model is

$I({\theta })=a_{i}^{2}*p_{i}({\theta })*q_{i}({\theta })$

In general, item information functions tend to look "bell-shaped." Highly discriminating items have tall, narrow information functions; they contribute greatly but over a narrow range. Less discriminating items provide less information but over a wider range.

Plots of item information can be used to see how much information an item contributes and to what portion of the scale score range. Because of local independence, item information functions are additive. Thus, the test information function is simply the sum of the information functions of the items on the exam. Using this property with a large item bank, test information functions can be shaped to control measurement error very precisely.

Estimation

A comparison of classical and modern test theory

Scoring

After the model is fit to data, each person has a theta estimate. This estimate is their score on the exam. This "IRT score" is computed and interpreted in a very different manner as compared to traditional scores like number or percent correct. However, for most tests, the (linear) correlation between the theta estimate and a traditional score is very high (e.g., .95). A graph of IRT scores against traditional scores shows an ogive shape implying that the IRT score is somewhat better at separating individuals with low or high trait standing.

It is worth noting the implications of IRT for test-takers. Tests are imprecise tools and the score achieved by an individual (the observed score) is always the true score occluded by some degree of error. This error may push the observed score higher or lower.

Also, nothing about these models refutes human development or improvement. A person may learn skills, knowledge or even so called "test-taking skills" which may translate to a higher true-score.

A brief list of references

Many books have been written that address item response theory or contain IRT or IRT-like models. This is a partial list, focusing on texts that provide more depth.

Lord, F.M. (1980). Applications of item response theory to practical testing problems. Mahwah, NJ: Erlbaum.

This book summaries much of Lord's IRT work, including chapters on the relationship between IRT and clasical methods, fundamentals of IRT, estimation, and several advanced topics. Its estimation chapter is now dated in that it primarily discusses joint maximum liklihood method rather than the marginal maximum liklihood method implemented by Darrell Bock and his colleages.

Embretson, S. and Reise, S. (2000). Item response theory for psychologists. Mahwah, NJ: Erlbaum.

This book is an accessible introduction to IRT, aimed, as the title says, at psychologists.

External links

IRT Tutorial

An introduction to IRT

All about Rasch Measurement

Category:

Psychometrics