[1] 0.8148199
Website: peterhalpin.github.io/RDIF-workshop/
Slides: These slides in HTML format
Notes: These slides DOCX format (translated, editable)
Code: Just the code from these slides
Psychometric models posit two non-mutually exclusive explanations of why the distribution of test scores may differ over groups of respondents
Impact: the groups differ on the trait being measured
DIF: the measure is biased with respect to group membership
The goal of DIF analysis is to detect biased items without making assumptions about impact
semTools::partialInvariance()
In IRT, usually the letter “\(\theta\)” denotes the latent trait (ability)
Like factor analysis, we usually assume \(\theta \sim N(0, 1)\) when considering a single group or population
It is customary to present IRT models in terms of the measurement model for each item: \(p(X_j \mid \theta)\)
For binary data:
\[ P_j(\theta) = \text{Prob}(X_j = 1 \mid \theta)\]
This is called the item response function (IRF)
It describes how the probability of endorsing an item depends on the level of the trait being measured
\[\begin{equation} P_j(\theta) = \frac{\exp(a_j (\theta - b_j))} {1 + \exp(a_j (\theta - b_j)} \end{equation}\]
The parameter \(b_j\) is called the item difficulty
Note that \(\theta = b_j\) implies
\[ P_j(b_j) = \frac{\exp(a_j (0))} {1 + \exp(a_j (0))} = 1/2 \]
So, item difficulty is the value of \(\theta\) at which the probability of endorsing the item is equal to 1/2
Respondents with ability above the difficulty level of the item have probability > 1/2 of answering the item correctly, and conversely
The parameter \(a_j > 0\) is called the item discrimination
Items with higher discrimination have steeper slopes, stronger relationship to latent trait:
\[ \frac{\partial}{\partial \theta} P_j(\theta) = a_j P_j(\theta) Q_j(\theta) \]
\[ \frac{\partial}{\partial \theta} P_j(b_j) = a_j /4 \]
Let \(\hat \theta\) denote the maximum likelihood estimate (MLE) of \(\theta\)
In practical terms, \(\hat \theta\) is the “best” estimate of the trait we can obtain from an assessment
The standard error of the MLE, \(SE(\theta)\) describes how precisely we can estimate \(\theta\)
Information defined as the precision of the MLE, \(1/SE(\theta)^2\)
One of the main contributions of IRT is to model how information depends on the parameters of test items
This provides a good theory for test development! It tells us how to build tests with a desired level of precision / information
We address information here for completeness but it is not required for DIF analysis
The item information function (IIF) is precision that results when estimating the latent trait using a single item
In practice, we would never use only a single item on a test
But, we can build up the information function of the entire test from that of each individual item
So, we start with the IIF and then use that to get the test information function (TIF)
For the 2PL, the IIF is:
\[\begin{equation} I_j(\theta) = a_j^2 P_j(\theta) Q_j(\theta) \end{equation}\]
The TIF is obtained by summing the information functions of all of the items on a test
\[ I(\theta) = \sum_{j = 1}^{J} I_j(\theta) \]
\[R(\theta) = \frac{I(\theta)}{1 + I(\theta)}\]
Marginal reliability of the total score:
[1] 0.8148199
A central concept in IRT is (Fisher) information, which is the precision with which the target construct can be estimated
Item information functions describe the information provided by each item
Test information is the sum of the items’ information
Reliability of the total score can be derived from the test information
Information is useful for comparing different items / tests but reliability is easier to interpret
Let the item response \(X_j\) take on values \(c \in {1, \dots C}\), where \(C\) is the number of categories for the item
It doesn’t matter what we label the categories as long as they are ordered
For CINT: never (0), rarely (1), sometimes (2), or almost always (3)
The cumulative response function is the probability of endorsing category \(c\) or higher, conditional on \(\theta\)
\[ P_{jc}(\theta) = \text{Prob} (X_j \geq c \mid \theta) \]
\[\text{Prob} (X_j = c \mid \theta) = P_{jc}(\theta) - P_{j, c+1}(\theta)\]
\[ P_{jc}(\theta) = \frac{\exp(a_j (\theta - b_{jc}))} {1 + \exp(a_j (\theta - b_{jc}))} \]
Each item has only one discrimination (proportional odds assumption)
Each response category has its own difficulty parameter, now called a “threshold” parameter
\[\begin{align} \text{Prob} (X_j = 1 \mid \theta) & = 1 - P_{j1}(\theta) \\ \text{Prob} (X_j = 2 \mid \theta) & = P_{j1}(\theta) - P_{j2}(\theta) \\ ... \\ \text{Prob} (X_j = C \mid \theta) & = P_{jC}(\theta) - 0 \end{align}\]
These give the probability of endorsing each category
Note that this reduces to 2PL for \(C = 2\)
library(mirt)
# Load data and separate depression items
cint <- read.csv("cint_data.csv")
depression_names <- c("cint1", "cint2", "cint4", "cint11",
"cint27", "cint28", "cint29", "cint30")
depression_items <- cint[, depression_names]
# Run GRM model
grm <- mirt(depression_items,
itemtype = "graded")
# per item plots
itemplot(grm,
item = 1,
type = "threshold",
main = "Cumulative response functions")
itemplot(grm,
item = 1,
type = "trace",
main = "Category response functions")
# Plotting all test items
plot(grm,
type = "itemscore",
main = "Expected item score functions",
facet = F)
Plots for CINT.1 (“Feels sad or depressed”)
The cumulative response functions (top) are not usually reported, shown here to illustrate the 2PL assumption
The ICRFs (bottom) are usually reported, show probability of endorsing each category
Values of item thresholds are shown by dashed vertical lines
Plotting the expected item score is a way of simplifying presentation of the entire assessment
It shows how the expected response (0-3) depends on the measured trait
Computed as \(\sum_c c\times \text{Prob} (X_j = c \mid \theta)\)
From left to right, items that were “easier” in the sense that higher item scores were expected for lower values of depression
$items
a b1 b2 b3
cint1 1.612 -1.622 -0.011 1.578
cint2 1.286 -1.049 0.306 1.665
cint4 1.129 -1.879 -0.195 1.758
cint11 1.219 -1.688 -0.402 1.608
cint27 1.692 -0.594 0.279 1.404
cint28 1.213 -1.292 -0.001 1.641
cint29 1.351 -0.004 0.889 2.217
cint30 1.155 -0.630 0.435 1.910
$means
F1
0
$cov
F1
F1 1
Usually plots are presented to summarize models, rather than coefficient tables
IIFs, TIFs, and reliability are provided in the Appendix
GRM is widely used for ordered categorical responses
The cumulative response functions are modeled using 2PL
The ICRFs are derived from the cumulative response functions
The goal of DIF analysis is to detect biased items without making assumptions about impact
So, we want to test whether item parameters differ over groups
There are lots of ways to do this, but for the example we will focus on the likelihood ratio test for nested models
If an item exhibits DIF, should be investigated (e.g., revise, omit)
Sounds simple enough, but…
The more obvious problem: Infer whether item parameters differ as a function of some external variable(s)
For illustrative purposes, consider Lord’s (Wald) test for difficulty parameter of 2PL in groups g = \(0, 1\)
\[ z_j = \frac{\hat b_{1j} - \hat b_{0j}} {\sqrt{\text{var}(\hat b_{1j}) + \text{var}(\hat b_{0j})}}\]
The less obvious problem: IRT models are identified only up to an linear transformation of the latent trait
This means that the item parameters and latent trait can be linearly transformed without changing the IRFs
Let \(\theta^∗ = A\theta + B\), \(b^∗_j = A b_j + B\), and \(a^*_j = a_j/A\):
\[\begin{align} \text{logit} (P_j(\theta)) & = a^*_j(\theta^* - b^*_j) \\ & = \frac{a_j}{A}(A\theta + B - (A b_j + B)) \\ & = a_j(\theta - b_j) \end{align}\]
This is the technical reason that we need to set the scale of the latent trait when estimating psychometric models
Setting \(\theta \sim N(0, 1)\) implies that \(A = 1\) and \(B = 0\), which solves the problem
If we scale the latent trait to have the same mean \(\mu_g\) and variance \(\sigma_g\) in both groups, this has implications for testing model parameters
The scale transformations are
\[\begin{equation} \notag \theta^*_1 = \sigma_0 \left(\frac{\theta_1 - \mu_1}{\sigma_1}\right) + \mu_0 \quad \text{and} \quad b^*_{1i} = \sigma_0 \left(\frac{b_{1i} - \mu_1}{\sigma_1}\right) + \mu_0 \end{equation}\]
Plugging the rescaled item parameters into Lord’s test
\[\begin{align} \label{dstar}\notag z^*_i = \frac{\hat b^*_{1i} - \hat b_{0i}} {\sqrt{\text{var}(\hat b^*_{1i}) + \text{var}(\hat b_{0i})}} = \frac{\frac{\sigma_0}{\sigma_1} \left(\hat b_{1i} - \mu_1 \right) + \mu_0 - \hat b_{0i}} {\sqrt{\frac{\sigma^2_0}{\sigma^2_1} \text{var}(\hat b_{1i}) + \text{var}(\hat b_{0i})}} \end{align}\]
Conclusion: If there is impact on either the mean of the variance of the the latent trait, Lord’s test is biased
The problem just described has been referred to as the circular nature of DIF (Angoff, 198)
We want to compare the value of model parameters over groups
To do this we must scale the latent trait in both groups
To scale the latent trait, we must assume some model parameters are equal over groups
But this is what we wanted to test in the first place!
In practice, the problem is resolved by choosing an “anchor set” of items
Anchors are items that we treat as DIF-free when testing other items for DIF
There are many strategies, heuristics, etc. for choosing anchors
These are all flawed – anchor item selection is a limitation of traditional methods for DIF analysis
One widely used approach to choosing anchors is called purification and refinement
Stage 1. Test each item assuming every other item is an anchor. Before starting stage two, remove any item with DIF from the anchor set (“purification”)
Stage 2. Test the items without DIF again, using the purified anchor set
Can repeat as desired
Procedure is exploratory, involves many tests of DIF, fitting models with different restrictions, …
The LR test use a multi-group IRT model to test whether the parameters of an item differ over groups
Write the 2PL in two groups as follows:
\[\begin{align} \text{Reference group: } & \text{logit} (P_{j0}(\theta)) = a_{j0}(\theta - b_{j0}) \quad \text{ with } \theta \sim N(0, 1) \\ \text{Comparison group: } & \text{logit} (P_{j1}(\theta)) = a_{j1}(\theta - b_{j1}) \quad \text{ with } \theta \sim N(\mu, \sigma) \\ \end{align}\]
The second subscript on the IRFs and item parameters indicates the reference group (0) or the comparison group (1)
In the reference group, we scale the latent trait arbitrarily
In the comparison group we estimate the mean the variance of the latent trait
In order to apply the LR test, we estimate the following two models
Model 1: The nested (smaller) model is obtained by setting all item parameters equal across groups
\[a_{j0} = a_{j1} = a_{j} \quad \text{ and } \quad b_{j0} = b_{j1} = b_j \quad \text{for all } \quad j = 1 \dots J\]
Same as strong invariance in factor analysis
Model 2: The nesting (larger) model is obtained by allowing the parameters of the focal item to vary across groups:
\[a_{j0} \neq a_{j1} \quad \text{ and } \quad b_{j0} \neq b_{j1} \quad \text{for the focal item } j^* \]
Note we are not requiring the items to be unequal – they may be equal or unequal, and we simply allow them to be estimated freely in each group
Software automates the fitting of these item-by-item models
The LR test then proceeds by comparing the likelihood of the nested model the to nesting model
When the constraints imposed by the nested model are valid (i.e., if there is no DIF on the item), this test has a chi-square distribution with degrees of freedom equal to the number of constrained parameters
If the LR test of DIF is significant, we conclude that the item is biased
If not, then we conclude that the item is not biased
# Groups need to be a factor
gender <- factor(cint$cfemale)
# Invariance constraints used by mirt
strong.invariance <- c("free_mean", "free_var", "slopes", "intercepts")
# Estimate model (can request SE using SE = T)
strong.mod <- multipleGroup(depression_items,
group = gender,
itemtype = "graded",
invariance = strong.invariance)
# View output
coef(strong.mod, IRTpars = T, simplify = T)
DIF(strong.mod,
which.par = c("a1", "d1", "d2", "d3"), # <- mirt notation
scheme = "drop") # <- drop item constraints
groups converged AIC SABIC HQ BIC X2 df p
cint1 0,1 TRUE 3.113 9.344 10.370 22.047 4.887 4 0.299
cint2 0,1 TRUE 4.692 10.923 11.949 23.626 3.308 4 0.508
cint4 0,1 TRUE 3.169 9.400 10.425 22.102 4.831 4 0.305
cint11 0,1 TRUE 1.948 8.178 9.204 20.881 6.052 4 0.195
cint27 0,1 TRUE 0.629 6.860 7.886 19.563 7.371 4 0.118
cint28 0,1 TRUE 3.090 9.321 10.347 22.024 4.91 4 0.297
cint29 0,1 TRUE -23.411 -17.180 -16.154 -4.477 31.411 4 0
cint30 0,1 TRUE -9.195 -2.964 -1.939 9.738 17.195 4 0.002
DIF(strong.mod,
which.par = c("a1", "d1", "d2", "d3"),
scheme = "drop_sequential", #<- different scheme
seq_stat = .05, # <- Type I Error rate for DIF
max_run = 2) # <- two stages only
Checking for DIF in 6 more items
Computing final DIF estimates...
groups converged AIC SABIC HQ BIC X2 df p
cint29 0,1 TRUE -18.863 -12.632 -11.606 0.071 26.863 4 0
cint30 0,1 TRUE -4.647 1.584 2.610 14.286 12.647 4 0.013
DIF analysis identified two items that were biased with respect to gender
More questions:
One way to investigate these questions:
# Invariance constraints
partial.invariance <- c("free_mean", "free_var",
"cint1", "cint2", "cint4", "cint27", "cint28")
# Estimate model
partial.mod <- multipleGroup(depression_items,
group = gender,
itemtype = "graded",
invariance = partial.invariance)
# Plot IRFs of biased items
itemplot(partial.mod, type = "score", item = "cint29", main = "CINT 29")
itemplot(partial.mod, type = "score", item = "cint29", main = "CINT 30")
# Examine parameter estimates
coef(partial.mod, IRTpars = T, simplify = T)
DIF analysis identified two items that were biased with respect to gender
Females were expected to report higher scores that males, even if they had the same level of depression
Gender differences on depression changed when items with DIF were allow to vary over groups (partial invariance)
A limitation of current methods is that we cannot directly test whether DIF affects conclusions about impact
We have seen how to test for DIF using IRT models (2PL GRM)
In our example, we found two items biased with respect to gender
We discussed two limitations of DIF analysis
New procedures for addressing DIF and DTF
Do not require selection of anchor items
Guaranteed to work if < 50% of items exhibit DIF
Can be used to test for whether DIF affects impact (without having to first test for DIF in each item!)
Easy to implement
Angoff, W. (1982). Use of difficulty and discrimination indices for detecting item bias. In R. Berk (Ed.), Handbook of Methods for Detecting Test Bias (pp. 96–116). The Johns Hopkins Press.
Nicewander, W. A. (2018). Conditional reliability coefficients for test scores. Psychological Methods, 23(2), 351–362. https://doi.org/10.1037/met0000132
Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential Item Functioning (pp. 67–113). Lawrence Erlbaum Associates.
item S_X2 df.S_X2 RMSEA.S_X2 p.S_X2
1 cint1 37.080 39 0.000 0.558
2 cint2 51.364 44 0.014 0.207
3 cint4 40.499 43 0.000 0.580
4 cint11 59.168 43 0.021 0.051
5 cint27 81.114 40 0.035 0.000
6 cint28 41.579 43 0.000 0.533
7 cint29 55.568 43 0.019 0.095
8 cint30 39.844 46 0.000 0.727