Website: peterhalpin.github.io/RDIF-workshop/
Slides: These slides in HTML format
Notes: These slides DOCX format (translated, editable)
Code: Just the code from these slides
Test scores: Observed scores vs IRT-based scores
Models for IRT-based scores: Concurrent calibration (multi-group models) vs separate calibration (separate models)
Research design: equivalent vs non-equivalent groups, different or overlapping test items
Test scores: Observed scores vs IRT-based scores
Models for IRT-based scores: Concurrent calibration (multi-group models) vs separate calibration (separate models)
Research design: equivalent vs non-equivalent groups, different or overlapping test items
Scaling functions are used to compute scaling parameters using item parameters
e.g., “mean-mean” scaling for 2PL model
There are many scaling functions, we focus on this kind of “moment-based” function
See Appendix for more details
“Out of the box” robust regression doesn’t work very well for this problem
Regression model misses some peculiar aspects of the scaling problem
\[ \Psi(\mu) = \sum_i \psi\left(\frac{Y_i - \mu}{V[Y_i]}\right) = 0\]
\[ \hat\mu = w_i z_i \quad \text{where} \quad z_i = \frac{Y_i - \mu}{V[Y_i]}\]
\[ w_i = \left\{\begin{array}{ccc} \left(1 - \left( \frac{z_i}{k} \right)^2\right)^2 & \text{ for } & {\mid z_i \mid } \leq k \\ 0 & \text{ for } &{\mid z_i \mid } > k \\ \end{array} \right. \]
The approach does not require specification of anchor items
Theoretical results guarantee that the procedure can tolerate up 50% of items with DIF
Data simulation results show that it performs well compared to other methods
2PL in two groups, DIF in item difficulty only
DIF on item difficulties (intercepts) only (\(\Delta_i = .5\))
Impact on mean only (\(\mu_1 = .5\))
Focal factors
Design factors
robustDIF
package has been updated recently, so let’s re-install nowmirt
or list of two separate fitslibrary(mirt)
# Set up data for mirt
cint <- read.csv("cint_data.csv")
depression_names <- c("cint1", "cint2", "cint4", "cint11",
"cint27", "cint28", "cint29", "cint30")
depression_items <- cint[, depression_names]
gender <- factor(cint$cfemale)
# Estimate model (no invariance constraints)
config.mod <- multipleGroup(depression_items,
group = gender,
itemtype = "graded",
SE = T) # <- make sure to request SEs
# Print parms (in slope-intercept format)
coef(config.mod, IRTpars = F, simplify= T)
$`0`
$items
a1 d1 d2 d3
cint1 1.517 2.099 -0.442 -2.866
cint2 1.323 1.100 -0.704 -2.362
cint4 1.106 1.760 -0.025 -2.130
cint11 0.931 1.723 0.249 -2.313
cint27 1.647 0.799 -0.780 -2.607
cint28 0.909 1.194 -0.312 -2.141
cint29 1.074 -0.663 -1.908 -3.528
cint30 1.101 0.766 -0.525 -2.278
$means
F1
0
$cov
F1
F1 1
$`1`
$items
a1 d1 d2 d3
cint1 1.580 3.076 0.411 -2.226
cint2 1.242 1.566 -0.142 -1.961
cint4 1.122 2.480 0.431 -1.851
cint11 1.420 2.371 0.712 -1.689
cint27 1.748 1.190 -0.216 -2.205
cint28 1.443 1.941 0.303 -1.860
cint29 1.468 0.628 -0.627 -2.619
cint30 1.271 0.702 -0.501 -2.192
$means
F1
0
$cov
F1
F1 1
$group.1
a1 d1 d2 d3
item1 1.5167838 2.0990325 -0.44236088 -2.866174
item2 1.3226290 1.0996419 -0.70402161 -2.361671
item3 1.1060247 1.7599867 -0.02459558 -2.130449
item4 0.9305090 1.7228653 0.24874502 -2.313067
item5 1.6472558 0.7990658 -0.77984423 -2.606872
item6 0.9094216 1.1941310 -0.31173439 -2.141181
item7 1.0741429 -0.6629446 -1.90846725 -3.528486
item8 1.1011885 0.7658870 -0.52489886 -2.277890
$group.2
a1 d1 d2 d3
item1 1.580371 3.0757984 0.4107194 -2.226117
item2 1.242215 1.5663720 -0.1421765 -1.960686
item3 1.122109 2.4799671 0.4308424 -1.850987
item4 1.419502 2.3706138 0.7115116 -1.688975
item5 1.748094 1.1899711 -0.2163709 -2.205330
item6 1.442756 1.9410021 0.3027256 -1.859963
item7 1.467982 0.6282358 -0.6270320 -2.619477
item8 1.270718 0.7024522 -0.5005097 -2.192099
rdif
is internal function for estimation$est
[1] 0.404222
$weights
[1] 0.3839973 0.4828362 0.9999925 0.9835501 0.9494398 0.9304414 0.4264334
[8] 0.9999393 0.7666714 0.9440296 0.8000310 0.9806982 0.1649757 0.7915727
[15] 0.6049233 0.6791094 0.9837524 0.4172171 0.0000000 0.0000000 0.6549483
[22] 0.0000000 0.0000000 0.1326015
$n.iter
[1] 23
$epsilon
[1] 7.273842e-08
rdif_z_test
and rdif_chisq_test
for user-friendly output z.test p.val
cint1.d1 1.209 0.227
cint1.d2 1.083 0.279
cint1.d3 0.004 0.997
cint2.d1 -0.178 0.859
cint2.d2 0.314 0.754
cint2.d3 -0.369 0.712
cint4.d1 1.155 0.248
cint4.d2 0.011 0.991
cint4.d3 -0.691 0.489
cint11.d1 0.330 0.741
cint11.d2 -0.637 0.524
cint11.d3 0.193 0.847
cint27.d1 -1.510 0.131
cint27.d2 -0.651 0.515
cint27.d3 -0.924 0.356
cint28.d1 0.822 0.411
cint28.d2 0.177 0.859
cint28.d3 -1.166 0.244
cint29.d1 3.750 0.000
cint29.d2 2.987 0.003
cint29.d3 0.856 0.392
cint30.d1 -3.189 0.001
cint30.d2 -2.587 0.010
cint30.d3 -1.563 0.118
rdif_z_test
and rdif_chisq_test
for user-friendly output z.test p.val
cint1.a1 -0.023 0.982
cint2.a1 -0.684 0.494
cint4.a1 -0.180 0.857
cint11.a1 2.238 0.025
cint27.a1 0.102 0.919
cint28.a1 2.471 0.013
cint29.a1 1.558 0.119
cint30.a1 0.603 0.547
rdif_z_test
and rdif_chisq_test
for user-friendly output chi.square df p.val
cint1 2.269 4 0.686
cint2 2.347 4 0.672
cint4 2.798 4 0.592
cint11 12.688 4 0.013
cint27 3.360 4 0.499
cint28 11.099 4 0.025
cint29 30.338 4 0.000
cint30 11.697 4 0.020
cint29
and cint30
were found to have# DIF analysis for item slopes with mirt
strong.invariance <- c("free_mean", "free_var", "slopes", "intercepts")
strong.mod <- multipleGroup(depression_items,
group = gender,
itemtype = "graded",
invariance = strong.invariance,
verbose = F)
DIF(strong.mod,
which.par = c("a1"),
scheme = "drop")
groups converged AIC SABIC HQ BIC X2 df p
cint1 0,1 TRUE 0.013 1.571 1.828 4.747 1.987 1 0.159
cint2 0,1 TRUE -0.454 1.104 1.360 4.279 2.454 1 0.117
cint4 0,1 TRUE -0.826 0.731 0.988 3.907 2.826 1 0.093
cint11 0,1 TRUE -2.737 -1.179 -0.923 1.997 4.737 1 0.03
cint27 0,1 TRUE 0.318 1.876 2.132 5.051 1.682 1 0.195
cint28 0,1 TRUE -0.339 1.218 1.475 4.394 2.339 1 0.126
cint29 0,1 TRUE -9.060 -7.503 -7.246 -4.327 11.06 1 0.001
cint30 0,1 TRUE 1.246 2.803 3.060 5.979 0.754 1 0.385
So far we have focused on using a robust estimate of scaling parameters (impact) as a way to test for DIF individual items
We can also compare the robust estimate to a “naive” estimate that would arise if we ignored DIF
If the two estimates give the “same” result, then DIF does not affect conclusions about impact
The logic of this test is the similar to the Hausman specification test
Under the null hypothesis, both the robust estimate and the MLE are consistent (unbiased) estimates of the “true” impact
Under the alternative hypothesis, the both may be inconsistent, but the robust estimate will be less biased
Consequently, the difference between the estimates can be used to test for the affect of DIF on impact
If all we want is to compare groups test scores:
Using robust approach, we can test for DTF without before having to do item-by-item DIF analysis
If there is no DTF, can proceed with group comparisons without doing DIF analysis
If there is DTF, can follow up with item-by-item analyses, test revisions, etc, before making comparisons
rdif.est ml.est delta se.delta z.test p.val
0.4042220 0.3955726 0.0086494 0.0396633 0.2180706 0.8273741
In test development, we almost always want to know about DIF at the item level
In research settings, sometimes we just care about whether comparisons between groups are biased or not
Using robust scaling, we can make inferences about DTF before doing an item-by-item DIF analysis
Unlike tests of MI, we are not testing whether all items are DIF-free
If we conclude there is no DTF, there may or not be DIF
rdif_z_test
and rdif_chisq_test
delta_test
robustDIF
is in early stages of development
peter.halpin@unc.edu
Bechger, T. M., & Maris, G. (2015). A Statistical Test for Differential Item Pair Functioning. Psychometrika, 80, 317–340. https://doi.org/10.1007/s11336- 014- 9408- y
Belzak, W. C. M., & Bauer, D. J. (2020). Improving the assessment of measurement invariance: Using regularization to select anchor items and identify differential item functioning. Psychological Methods, 25(6), 673–690. https://doi.org/10.1037/met0000253 He, Y., &
He, Y., & Cui, Z. (2020). Evaluating Robust Scale Transformation Methods With Multiple Outlying Common Items Under IRT True Score Equating. Applied Psychological Measurement, 44(4), 296–310. https://doi.org/10.1177/0146621619886050
He, Y., Cui, Z., & Osterlind, S. J. (2015). New Robust Scale Transformation Methods in the Presence of Outlying Common Items. Applied Psychological Measurement, 39(8), 613–626.https://doi.org/10.1177/0146621615587003
Huber, P. J., & Ronchetti, E. (2009). Robust statistics (2nd ed). Wiley.
Magis, D., Tuerlinckx, F., & De Boeck, P. (2015). Detection of Differential Item Functioning Using the Lasso Approach. Journal of Educational and Behavioral Statistics, 40(2), 111–135. https://doi.org/10.3102/1076998614559747
Schauberger, G., & Mair, P. (2020). A regularization approach for the detection of differential item functioning in generalized partial credit models. Behavior Research Methods, 52(1), 279–294. https://doi.org/10.3758/s13428-019-01224-2
Stocking, M. L., & Lord, F. M. (1983). Developing a Common Metric in Item Response Theory. Applied Psychological Measurement, 7(2), 201–210. https://doi.org/10.1177/014662168300700208
Teresi, J. A., Wang, C., Kleinman, M., Jones, R. N., & Weiss, D. J. (2021). Differential Item Functioning Analyses of the Patient-Reported Outcomes Measurement Information System (PROMIS) Measures: Methods, Challenges, Advances, and Future Directions. Psychometrika, 86(3), 674–711. https://doi.org/10.1007/s11336-021-09775-0
Wang, W., Liu, Y., & Liu, H. (2022). Testing Differential Item Functioning Without Predefined Anchor Items Using Robust Regression. Journal of Educational and Behavioral Statistics, 47(6), 666–692. https://doi.org/10.3102/10769986221109208
Yuan, K.-H., Liu, H., & Han, Y. (2021). Differential Item Functioning Analysis Without A Priori Information on Anchor Items: QQ Plots and Graphical Test. Psychometrika, 86(2), 345–377. https://doi.org/10.1007/s11336-021-09746-5
Specify 2PL IRT model in “slope-intercept” form
Group 1
\[\text{logit}(p_{0i}) = a_{0i} \theta_{0} + d_{0i} \; \text{ with } \; \theta_0 \sim N(0, 1)\]
\[\text{logit}(p_{1i}) = a_{1i} \theta_{1} + d_{1i} \; \text{ with } \; \theta_1 = (\theta^*_1 - \mu) / \sigma \; \text{ and } \; \theta^*_1 \sim N(\mu, \sigma^2)\]
\[a_{1i} \theta_{1} + d_{1i} = a_{1i}^* \theta_{1}^* + a_{1i}^*\]
library(lavaan)
# Model (same as above)
mod1 <- ' depression =~ cint1 + cint2 + cint4 + cint11 +
cint27 + cint28 + cint29 + cint30'
# Fit model
fit.config <- cfa(mod1,
data = cint,
std.lv = T,
ordered = T,
group = "cfemale") # <--- new
# extract parms
lavaan.parms <- get_model_parms(fit.config)
# RDIF procedures (groups are reversed)
delta_test(lavaan.parms)
rdif_z_test(lavaan.parms, par = "intercept")
rdif_z_test(lavaan.parms, par = "slope")
rdif_chisq_test(lavaan.parms)