Part 2: Differential Item Functioning

Peter F. Halpin

Overview of Workshop

Part 1. Intro + factor analysis + MI
Part 2. IRT + DIF \({\color{green}\leftarrow}\)
Part 3. Robust scaling + DIF + DTF

Overview of Part 2

Review the goals of DIF analysis
IRT binary data (2PL)
IRT for ordered categorical data (GRM)
- GRM is the IRT analogue of categorical factor analysis
Testing DIF using the likelihood ratio test
- Selecting anchor items
Worked example

Organization

Website: peterhalpin.github.io/RDIF-workshop/
Slides: These slides in HTML format
Notes: These slides DOCX format (translated, editable)
Code: Just the code from these slides

Goals of DIF analysis

Item-by-item quality assurance

Recap

Psychometric models posit two non-mutually exclusive explanations of why the distribution of test scores may differ over groups of respondents
1. Impact: the groups differ on the trait being measured
2. DIF: the measure is biased with respect to group membership
The goal of DIF analysis is to detect biased items without making assumptions about impact

DIF as a follow up to MI?

In factor analysis framework
- MI: testing the hypothesis of “no DIF” over all items
- DIF: procedures for following-up rejection of MI
  - Also called partial MI
  - see semTools::partialInvariance()

DIF as a follow up to MI?

In IRT framework
- The goal of DIF is to make item-level decisions (keep, revise, omit)
- Levels of MI don’t translate directly into item level decisions
- So DIF analysts usually skip MI (see Thissen, 2023)

Big picture

Generic measurement model: Weak / metric MI

Big picture

Generic measurement model: Strong / scalar MI

Big picture

Generic measurement model: Strict MI

Big picture

Generic measurement model: DIF proceeds item-by-item

Big picture

Generic measurement model: DIF proceeds item-by-item

Big picture

Generic measurement model: DIF proceeds item-by-item

Summary

MI and DIF are about the same issue
But they approach the issue differently
- MI: invariance of parameters over groups
- DIF: invariance of items over groups
Conceptually, could be combined into one “grand theory”
In practice, different models, different software, different research settings, different traditions …

IRT

Binary data, then ordered categorical data

IRT

There are many IRT models
We will focus on two models analogous to factor analysis for categorical data
In terms of DIF analysis, same principals apply regardless of which model is used

Note on terminology

IRT was developed in the context of educational testing
Much of the model terminology reflects this context
- e.g., we talk about the difficulty of items, the ability of respondents
Different terminology is more suitable in other settings
- e.g, the severity of symptoms, the depression of respondents
Common to use different terminology depending on the context

Note on math symbols

In IRT, usually the letter “\(\theta\)” denotes the latent trait (ability)
- I will use this terminology to be consistent with IRT software
Like factor analysis, we usually assume \(\theta \sim N(0, 1)\) when considering a single group or population
- Will address how to estimate impact later

The 2-parameter logistic (2PL) model

The 2PL is applicable to binary item responses, e.g.,
- correct / incorrect
- yes / no
- endorsed / not endorsed
Also a building block for models with > 2 ordered response categories

Item response functions (IRFs)

It is customary to present IRT models in terms of the measurement model for each item: \(p(X_j \mid \theta)\)
For binary data:

\[ P_j(\theta) = \text{Prob}(X_j = 1 \mid \theta)\]

This is called the item response function (IRF)
It describes how the probability of endorsing an item depends on the level of the trait being measured

The 2PL IRF

\[\begin{equation} P_j(\theta) = \frac{\exp(a_j (\theta - b_j))} {1 + \exp(a_j (\theta - b_j)} \end{equation}\]

This is just the logistic function from logistic regression
\(a_j\) is called the item discrimination parameter
- Slope of the logistic regression, similar to factor loadings
\(b_j\) is called the item difficulty parameter
- Re-scaled intercept

2PL IRF examples

Interpretation of model parameters

The parameter \(b_j\) is called the item difficulty
Note that \(\theta = b_j\) implies

\[ P_j(b_j) = \frac{\exp(a_j (0))} {1 + \exp(a_j (0))} = 1/2 \]

So, item difficulty is the value of \(\theta\) at which the probability of endorsing the item is equal to 1/2
Respondents with ability above the difficulty level of the item have probability > 1/2 of answering the item correctly, and conversely

Interpretation of model parameters

Difficulty is the level of the trait (\(\theta\)) required to have probability ≥ 1/2 of endorsing the item

Interpretation of model parameters

The parameter \(a_j > 0\) is called the item discrimination
Items with higher discrimination have steeper slopes, stronger relationship to latent trait:

\[ \frac{\partial}{\partial \theta} P_j(\theta) = a_j P_j(\theta) Q_j(\theta) \]

Rate of change of IRF is higher for more discriminating items

Interpretation of model parameters

If we again plug-in \(\theta = b_j\), we have

\[ \frac{\partial}{\partial \theta} P_j(b_j) = a_j /4 \]

For values “close to” the difficulty parameter, the discrimination parameter is proportional to the slope

Interpretation of model parameters

Discrimination is (proportional to) the slope of the curve where it intersects \(P(\theta) = 1/2\).

Summary

The 2PL IRT model:
- Is used to model the probability of endorsing a binary item
- The difficulty parameter describes the level of the target construct at which probability of endorsement = 1/2
- The discrimination parameter describes how strongly each item is related to the target construct

Other IRT Concepts

Not required for DIF analysis but relevant to understand the theory

Information in IRT

Let \(\hat \theta\) denote the maximum likelihood estimate (MLE) of \(\theta\)
In practical terms, \(\hat \theta\) is the “best” estimate of the trait we can obtain from an assessment
The standard error of the MLE, \(SE(\theta)\) describes how precisely we can estimate \(\theta\)
Information defined as the precision of the MLE, \(1/SE(\theta)^2\)
- Not easy to interpret, but larger values mean more precise estimates

Information in IRT

One of the main contributions of IRT is to model how information depends on the parameters of test items
This provides a good theory for test development! It tells us how to build tests with a desired level of precision / information
We address information here for completeness but it is not required for DIF analysis

Item information function (IIF)

The item information function (IIF) is precision that results when estimating the latent trait using a single item
In practice, we would never use only a single item on a test
But, we can build up the information function of the entire test from that of each individual item
So, we start with the IIF and then use that to get the test information function (TIF)

Item information function (IIF)

For the 2PL, the IIF is:

\[\begin{equation} I_j(\theta) = a_j^2 P_j(\theta) Q_j(\theta) \end{equation}\]

Info increases with \(a_j\)
- Items that are more strongly related to the latent trait (more discriminating) provide more information about the trait
The maximum of the IIF for each item occurs when \(\theta = b_j\)
- Each item provides the most information at its difficulty level

IIFs

The location of the peak is \(b_j\); the height of the peak is \(a_j^2/4\)

Test information function (TIF)

The TIF is obtained by summing the information functions of all of the items on a test

\[ I(\theta) = \sum_{j = 1}^{J} I_j(\theta) \]

This result follows directly from the conditional independence assumption

TIF

Useful for comparing different tests, but values not easily interpreted…

Reliability

The TIF can be converted into a reliability function for the total score (see Nicewander, 2018):

\[R(\theta) = \frac{I(\theta)}{1 + I(\theta)}\]

Averaging this function over values of \(\theta\) provides a “marginal” reliability coefficient
- Interpreted as the proportion of variability in the total score that is associated with the trait (like Cronbach alpha)

Reliability

Marginal reliability of the total score:

[1] 0.8148199

Summary

A central concept in IRT is (Fisher) information, which is the precision with which the target construct can be estimated
Item information functions describe the information provided by each item
Test information is the sum of the items’ information
Reliability of the total score can be derived from the test information
Information is useful for comparing different items / tests but reliability is easier to interpret

The Graded Response Model (GRM)

Ordered categorical data

Let the item response \(X_j\) take on values \(c \in {1, \dots C}\), where \(C\) is the number of categories for the item
- e.g., if an item has 4 possible response categories, \(C = 4\) and \(c = 1, \dots 4\) are the 4 response categories.
It doesn’t matter what we label the categories as long as they are ordered
For CINT: never (0), rarely (1), sometimes (2), or almost always (3)

Item response functions

The cumulative response function is the probability of endorsing category \(c\) or higher, conditional on \(\theta\)

\[ P_{jc}(\theta) = \text{Prob} (X_j \geq c \mid \theta) \]

By definition, \(P_{j1}(\theta) = 1\), \(P_{j,C+1}(\theta) = 0\), and

\[\text{Prob} (X_j = c \mid \theta) = P_{jc}(\theta) - P_{j, c+1}(\theta)\]

\(\text{Prob} (X_j = c \mid \theta)\) are called the item category response functions (ICRFs)

GRM

GRM assumes a 2PL model for the cumulative response functions

\[ P_{jc}(\theta) = \frac{\exp(a_j (\theta - b_{jc}))} {1 + \exp(a_j (\theta - b_{jc}))} \]

Each item has only one discrimination (proportional odds assumption)
Each response category has its own difficulty parameter, now called a “threshold” parameter

GRM

From the cumulative response function, derive the ICRFs

\[\begin{align} \text{Prob} (X_j = 1 \mid \theta) & = 1 - P_{j1}(\theta) \\ \text{Prob} (X_j = 2 \mid \theta) & = P_{j1}(\theta) - P_{j2}(\theta) \\ ... \\ \text{Prob} (X_j = C \mid \theta) & = P_{jC}(\theta) - 0 \end{align}\]

These give the probability of endorsing each category
Note that this reduces to 2PL for \(C = 2\)

GRM: Example

library(mirt)

# Load data and separate depression items
cint <- read.csv("cint_data.csv")
depression_names <- c("cint1", "cint2", "cint4", "cint11", 
                      "cint27", "cint28", "cint29", "cint30")
depression_items <- cint[, depression_names]

# Run GRM model
grm <- mirt(depression_items, 
            itemtype = "graded")

# per item plots 
itemplot(grm,  
         item = 1, 
         type = "threshold",  
         main = "Cumulative response functions")

itemplot(grm, 
         item = 1, 
         type = "trace", 
         main = "Category response functions")

# Plotting all test items 
plot(grm, 
     type = "itemscore",  
     main = "Expected item score functions", 
     facet = F)

GRM: Example

Plots for CINT.1 (“Feels sad or depressed”)
The cumulative response functions (top) are not usually reported, shown here to illustrate the 2PL assumption
The ICRFs (bottom) are usually reported, show probability of endorsing each category
Values of item thresholds are shown by dashed vertical lines
- “difficulties” in the cumulative response functions
- “thresholds” or category boundary in ICRF

GRM: Example

Plotting the expected item score is a way of simplifying presentation of the entire assessment
It shows how the expected response (0-3) depends on the measured trait
Computed as \(\sum_c c\times \text{Prob} (X_j = c \mid \theta)\)
From left to right, items that were “easier” in the sense that higher item scores were expected for lower values of depression

GRM: Example

coef(grm, IRTpars = T, simplify = T)

$items
           a     b1     b2    b3
cint1  1.612 -1.622 -0.011 1.578
cint2  1.286 -1.049  0.306 1.665
cint4  1.129 -1.879 -0.195 1.758
cint11 1.219 -1.688 -0.402 1.608
cint27 1.692 -0.594  0.279 1.404
cint28 1.213 -1.292 -0.001 1.641
cint29 1.351 -0.004  0.889 2.217
cint30 1.155 -0.630  0.435 1.910

$means
F1 
 0 

$cov
   F1
F1  1

Usually plots are presented to summarize models, rather than coefficient tables
IIFs, TIFs, and reliability are provided in the Appendix

Summary

GRM is widely used for ordered categorical responses
The cumulative response functions are modeled using 2PL
- Each item has same discrimination for all categories, proportional odds assumption
- Each item has different threshold parameter for each category
The ICRFs are derived from the cumulative response functions
- These give the probability of endorsing each category

DIF

Overview

The goal of DIF analysis is to detect biased items without making assumptions about impact
So, we want to test whether item parameters differ over groups
There are lots of ways to do this, but for the example we will focus on the likelihood ratio test for nested models
- Same approached used for factor analysis
If an item exhibits DIF, should be investigated (e.g., revise, omit)
Sounds simple enough, but…

DIF is two interrelated problems

The more obvious problem: Infer whether item parameters differ as a function of some external variable(s)
For illustrative purposes, consider Lord’s (Wald) test for difficulty parameter of 2PL in groups g = \(0, 1\)

\[ z_j = \frac{\hat b_{1j} - \hat b_{0j}} {\sqrt{\text{var}(\hat b_{1j}) + \text{var}(\hat b_{0j})}}\]

(We will use likelihood ratio test later, but Lord’s test is simple interpret)

DIF is two interrelated problems

The less obvious problem: IRT models are identified only up to an linear transformation of the latent trait
This means that the item parameters and latent trait can be linearly transformed without changing the IRFs
Let \(\theta^∗ = A\theta + B\), \(b^∗_j = A b_j + B\), and \(a^*_j = a_j/A\):

\[\begin{align} \text{logit} (P_j(\theta)) & = a^*_j(\theta^* - b^*_j) \\ & = \frac{a_j}{A}(A\theta + B - (A b_j + B)) \\ & = a_j(\theta - b_j) \end{align}\]

This is the technical reason that we need to set the scale of the latent trait when estimating psychometric models
Setting \(\theta \sim N(0, 1)\) implies that \(A = 1\) and \(B = 0\), which solves the problem

Implications of scaling for Lord’s test

If we scale the latent trait to have the same mean \(\mu_g\) and variance \(\sigma_g\) in both groups, this has implications for testing model parameters
The scale transformations are

\[\begin{equation} \notag \theta^*_1 = \sigma_0 \left(\frac{\theta_1 - \mu_1}{\sigma_1}\right) + \mu_0 \quad \text{and} \quad b^*_{1i} = \sigma_0 \left(\frac{b_{1i} - \mu_1}{\sigma_1}\right) + \mu_0 \end{equation}\]

Plugging the rescaled item parameters into Lord’s test

\[\begin{align} \label{dstar}\notag z^*_i = \frac{\hat b^*_{1i} - \hat b_{0i}} {\sqrt{\text{var}(\hat b^*_{1i}) + \text{var}(\hat b_{0i})}} = \frac{\frac{\sigma_0}{\sigma_1} \left(\hat b_{1i} - \mu_1 \right) + \mu_0 - \hat b_{0i}} {\sqrt{\frac{\sigma^2_0}{\sigma^2_1} \text{var}(\hat b_{1i}) + \text{var}(\hat b_{0i})}} \end{align}\]

Conclusion: If there is impact on either the mean of the variance of the the latent trait, Lord’s test is biased

How do we solve the scaling problem?

Step 1. Arbitrarily scale the latent trait in the “reference group”
- Warranted because IRT models for a single group are identified only up to an linear transformation of the latent trait
Step 2. Assume that (some of) the item parameters are equal over groups
- These items are called anchors
- Suffices to scale the latent trait in the comparison group}
- e.g., set \(b_{0i} = b_{1i}\) for at least 2 items and solve for \(\mu_1\) and \(\sigma_1\) in previous slide
Conclusion: We need to know the items without DIF (anchors) to scale the latent trait

The “circular nature” of DIF

The problem just described has been referred to as the circular nature of DIF (Angoff, 198)
- We want to compare the value of model parameters over groups
- To do this we must scale the latent trait in both groups
- To scale the latent trait, we must assume some model parameters are equal over groups
- But this is what we wanted to test in the first place!

Anchor items

In practice, the problem is resolved by choosing an “anchor set” of items
Anchors are items that we treat as DIF-free when testing other items for DIF
There are many strategies, heuristics, etc. for choosing anchors
These are all flawed – anchor item selection is a limitation of traditional methods for DIF analysis
- More on this in the last part of this workshop

Testing DIF with the Likelihood Ratio (LR) Test

The LR test

The LR test use a multi-group IRT model to test whether the parameters of an item differ over groups

This test is applicable to any IRT model
- Will focus on 2PL for simplicity, but illustration will use GRM
This approach is subject concerns about anchor item selection mentioned above
- We need to choose a set of items that are considered not to have DIF when testing which items do have DIF

LR test for 2PL

Write the 2PL in two groups as follows:

\[\begin{align} \text{Reference group: } & \text{logit} (P_{j0}(\theta)) = a_{j0}(\theta - b_{j0}) \quad \text{ with } \theta \sim N(0, 1) \\ \text{Comparison group: } & \text{logit} (P_{j1}(\theta)) = a_{j1}(\theta - b_{j1}) \quad \text{ with } \theta \sim N(\mu, \sigma) \\ \end{align}\]

The second subscript on the IRFs and item parameters indicates the reference group (0) or the comparison group (1)
In the reference group, we scale the latent trait arbitrarily
- Usually, standardized to have \(E(\theta) = 0\) and \(V(\theta) = 1\)
In the comparison group we estimate the mean the variance of the latent trait
- This rationale for this set up was discussed above when addressing the multi-group scaling problem

LR test for 2PL

In order to apply the LR test, we estimate the following two models
Model 1: The nested (smaller) model is obtained by setting all item parameters equal across groups

\[a_{j0} = a_{j1} = a_{j} \quad \text{ and } \quad b_{j0} = b_{j1} = b_j \quad \text{for all } \quad j = 1 \dots J\]

Same as strong invariance in factor analysis
Model 2: The nesting (larger) model is obtained by allowing the parameters of the focal item to vary across groups:

\[a_{j0} \neq a_{j1} \quad \text{ and } \quad b_{j0} \neq b_{j1} \quad \text{for the focal item } j^* \]

Note we are not requiring the items to be unequal – they may be equal or unequal, and we simply allow them to be estimated freely in each group
Software automates the fitting of these item-by-item models

LR test for 2PL

The LR test then proceeds by comparing the likelihood of the nested model the to nesting model
When the constraints imposed by the nested model are valid (i.e., if there is no DIF on the item), this test has a chi-square distribution with degrees of freedom equal to the number of constrained parameters
If the LR test of DIF is significant, we conclude that the item is biased
If not, then we conclude that the item is not biased

Strict invariance: Code

Step 1. Estimate a model in which item slopes and intercepts are invariant over groups (strong invariance)

# Groups need to be a factor 
gender <- factor(cint$cfemale)

# Invariance constraints used by mirt
strong.invariance <- c("free_mean", "free_var", "slopes", "intercepts")

# Estimate model (can request SE using SE = T)
strong.mod <- multipleGroup(depression_items, 
                            group = gender, 
                            itemtype = "graded",
                            invariance = strong.invariance)

# View output
coef(strong.mod, IRTpars = T, simplify = T)

Testing DIF using the LR test: Output

Testing DIF using the LR test: Code

Step 2. Run DIF analysis (without purification)

DIF(strong.mod, 
    which.par = c("a1", "d1", "d2", "d3"), # <- mirt notation
    scheme = "drop")  # <- drop item constraints

       groups converged     AIC   SABIC      HQ    BIC     X2 df     p
cint1     0,1      TRUE   3.113   9.344  10.370 22.047  4.887  4 0.299
cint2     0,1      TRUE   4.692  10.923  11.949 23.626  3.308  4 0.508
cint4     0,1      TRUE   3.169   9.400  10.425 22.102  4.831  4 0.305
cint11    0,1      TRUE   1.948   8.178   9.204 20.881  6.052  4 0.195
cint27    0,1      TRUE   0.629   6.860   7.886 19.563  7.371  4 0.118
cint28    0,1      TRUE   3.090   9.321  10.347 22.024   4.91  4 0.297
cint29    0,1      TRUE -23.411 -17.180 -16.154 -4.477 31.411  4     0
cint30    0,1      TRUE  -9.195  -2.964  -1.939  9.738 17.195  4 0.002

Testing DIF using the LR test: Code

Step 2. Run DIF analysis (with purification)

DIF(strong.mod, 
    which.par = c("a1", "d1", "d2", "d3"), 
    scheme = "drop_sequential", #<- different scheme
    seq_stat = .05,  # <- Type I Error rate for DIF
    max_run = 2) # <- two stages only


Checking for DIF in 6 more items
Computing final DIF estimates...

       groups converged     AIC   SABIC      HQ    BIC     X2 df     p
cint29    0,1      TRUE -18.863 -12.632 -11.606  0.071 26.863  4     0
cint30    0,1      TRUE  -4.647   1.584   2.610 14.286 12.647  4 0.013

Summary of example

DIF analysis identified two items that were biased with respect to gender
- CINT 29: “Go to their room and cry”
- CINT 30: “Feel restless and walk around”
More questions:
- Direction and size of the effect?
- Does DIF affect conclusions about impact?
One way to investigate these questions:
- Fit a model that allows items with DIF to vary over groups

Follow-up analyses: Code

Fit the partial invariance model

# Invariance constraints
partial.invariance <- c("free_mean", "free_var", 
                        "cint1", "cint2", "cint4", "cint27", "cint28")

# Estimate model
partial.mod <- multipleGroup(depression_items, 
                             group = gender, 
                             itemtype = "graded",
                             invariance = partial.invariance)


# Plot IRFs of biased items
itemplot(partial.mod, type = "score", item = "cint29", main = "CINT 29")
itemplot(partial.mod, type = "score", item = "cint29", main = "CINT 30")

# Examine parameter estimates
coef(partial.mod, IRTpars = T, simplify = T)

Follow-up analyses: Output

On both items, females were expected to report higher scores that males, even if they had the same level of depression

Follow-up analyses: Output

Summary of example

DIF analysis identified two items that were biased with respect to gender
- CINT 29: Go to their room and cry
- CINT 30: Feel restless and walk around
Females were expected to report higher scores that males, even if they had the same level of depression
Gender differences on depression changed when items with DIF were allow to vary over groups (partial invariance)
- Mean differences reduced about .1 SD
- Variance in females scores reduced as well
A limitation of current methods is that we cannot directly test whether DIF affects conclusions about impact
- More on this topic in part 3

Next Steps

Summary

We have seen how to test for DIF using IRT models (2PL GRM)
In our example, we found two items biased with respect to gender
- These items could be considered for revision or removal
We discussed two limitations of DIF analysis
- Choice of anchor items
- No direct test of whether DIF affects impact

What we will do next

New procedures for addressing DIF and DTF
Do not require selection of anchor items
Guaranteed to work if < 50% of items exhibit DIF
- Diagnostics available for greater proportions of items
Can be used to test for whether DIF affects impact (without having to first test for DIF in each item!)
Easy to implement

References

Angoff, W. (1982). Use of difficulty and discrimination indices for detecting item bias. In R. Berk (Ed.), Handbook of Methods for Detecting Test Bias (pp. 96–116). The Johns Hopkins Press.

Nicewander, W. A. (2018). Conditional reliability coefficients for test scores. Psychological Methods, 23(2), 351–362. https://doi.org/10.1037/met0000132

Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential Item Functioning (pp. 67–113). Lawrence Erlbaum Associates.

Appendix

Example: IIFs

plot(grm, type = "infotrace", theta_lim = c(-3,3), lwd = 2)

Example: TIFs

plot(grm, type = "info", theta_lim = c(-3,3), lwd = 2)

Example: Reliabilty

plot(grm, type = "rxx", theta_lim = c(-3,3), lwd = 2)

marginal_rxx(grm)

[1] 0.7944443

Example: Item fit

itemfit(grm)

    item   S_X2 df.S_X2 RMSEA.S_X2 p.S_X2
1  cint1 37.080      39      0.000  0.558
2  cint2 51.364      44      0.014  0.207
3  cint4 40.499      43      0.000  0.580
4 cint11 59.168      43      0.021  0.051
5 cint27 81.114      40      0.035  0.000
6 cint28 41.579      43      0.000  0.533
7 cint29 55.568      43      0.019  0.095
8 cint30 39.844      46      0.000  0.727

Part 2: Differential Item Functioning

Overview of Workshop

Overview of Part 2

Organization

Goals of DIF analysis

Recap

DIF as a follow up to MI?

DIF as a follow up to MI?

Big picture

Big picture

Big picture

Big picture

Big picture

Big picture

Summary

IRT

IRT

Note on terminology

Note on math symbols

The 2-parameter logistic (2PL) model

Item response functions (IRFs)

The 2PL IRF

2PL IRF examples

Interpretation of model parameters

Interpretation of model parameters

Interpretation of model parameters

Interpretation of model parameters

Interpretation of model parameters

Summary

Other IRT Concepts

Information in IRT

Information in IRT

Item information function (IIF)

Item information function (IIF)

IIFs

Test information function (TIF)

TIF

Reliability

Reliability

Summary

The Graded Response Model (GRM)

Ordered categorical data

Item response functions

GRM

GRM

GRM: Example

GRM: Example

GRM: Example

GRM: Example

Summary

DIF

Overview

DIF is two interrelated problems

DIF is two interrelated problems

Implications of scaling for Lord’s test

How do we solve the scaling problem?

The “circular nature” of DIF

Anchor items

Two-stage purification and refinement

Testing DIF with the Likelihood Ratio (LR) Test

The LR test

LR test for 2PL

LR test for 2PL

LR test for 2PL

Strict invariance: Code

Testing DIF using the LR test: Output

Testing DIF using the LR test: Code

Testing DIF using the LR test: Code

Summary of example

Follow-up analyses: Code

Follow-up analyses: Output

Follow-up analyses: Output

Summary of example

Next Steps

Summary

What we will do next

References

Appendix

Example: IIFs

Example: TIFs