Skip to contents

These notes outline an updated version of the robust DIF procedure described in Halpin (2022) that is now implemented by the robustDIF package.

Assume two groups of respondents with sample sizes n1n_1 and n2n_2 and let n=n1+n2n = n_1 + n_2. Also let Y=(Y1,,Ym)Y = (Y_1,\dots,Y_m)^\top denote item-level statistics derived from the parameter estimates of items i=1,mi = 1, \dots m. The asymptotic arguments presented below assume that n1n_1 and n2n_2 go to infinity at the same rate but that the number of items mm remains fixed and finite.

The YiY_i are chosen so that n(𝐘θ0𝟏)dN(0,Σ0)\sqrt{n}(\boldsymbol{Y} - \theta_0 \mathbf 1)\overset{d}\rightarrow N(0, \Sigma_0) under the null hypothesis of no DIF on any item. In this set-up, DIF means that YiY_i converges in probability to a fixed value other than θ0\theta_0. Some specific choices of YiY_i are detailed in Halpin (2024, 2025) and Halpin & Gilbert (2025). The notation V0(𝐘)=Σ0/nV_0(\boldsymbol{Y}) = \Sigma_0/n is used to denote the finite sample variance of 𝐘\boldsymbol{Y}, and similarly for other statistics.

The robust DIF procedure can be seen as solving two interrelated problems. First, it provides an M-estimator of θ0\theta_0 that is highly robust to DIF. Second, it provides a procedure for flagging items with DIF, which happens automatically as a by-product of estimating θ0\theta_0. Standard Wald tests of DIF are also available.

Halpin (2022) used the (unstated) assumption that Σ0\Sigma_0 is diagonal and attempted to combine efficiency and robustness in a way that did not clearly separate these two opposing considerations. These notes address these shortcomings and implement a new, simpler and more general, version of the robust DIF procedure. The analytical details are stated so that they apply to any bounded, redescending loss function, although for computation the focus is on Tukey’s bisquare. Differences with Halpin (2022) are pointed out as they arise.

The Updated Robust DIF Procedure

Defining the Estimator

The robust estimator θ̃\tilde\theta can be defined in three related ways. Let ui=ui(θ)=(Yiθ)/siu_i = u_i(\theta) = (Y_i - \theta)/{s_i} where si>0s_i>0 are item-specific scaling factors to be chosen subsequently. The three definitions of θ̃\tilde\theta are as follows.

  1. The minimizing argument of the loss function: R(θ)=i=1mρ(ui(θ)).R(\theta) = \sum_{i=1}^m \rho(u_i(\theta)). This definition is useful for deriving results about the robustness of θ̃\tilde\theta. For redescending loss functions there exists constants cc and kk such that ρ(ui)=c\rho(u_i) = c whenever |ui|k|u_i| \geq k. It is usual to scale ρ\rho so that c=1c = 1. The constant kk is treated as a tuning parameter that serves to identify outliers (i.e., items with DIF).

  2. The solution to the estimating equation: Ψ(θ)=i=1mψ(ui(θ))/si=0, \Psi(\theta) = \sum_{i=1}^m \psi(u_i(\theta))/s_i = 0, where ψ(u)=ρ(u)\psi(u) = \rho'(u). The influence function Ψ(θ)\Psi(\theta) is important for obtaining the variance of θ̃\tilde\theta.

  3. A weighted mean that is obtained by defining the weights wi*(θ)=ψ(ui(θ))/ui(θ)w^*_i(\theta) = \psi(u_i(\theta)) / u_i(\theta) and substituting these into the estimating equation to get: θ=i=1mwi(θ)Yiwithwi(θ)=wi*(θ)/si2j=1mwj*(θ)/sj2. \theta = \sum_{i = 1}^{m} w_i(\theta)\, Y_i \qquad \text{with} \qquad w_i(\theta) = \frac{w^*_i(\theta) / s_i^2}{\sum_{j=1}^{m} w^*_j(\theta) / s_j^2}. By convention, wi*(0)1w^*_i(0) \equiv 1 to avoid division by zero. Also note that, when |ui|k|u_i| \geq k, ψ(ui)=wi*(θ)=0\psi(u_i) = w^*_i(\theta) = 0, so that outliers (as defined by kk) are “redescended” to zero. The weighted mean is useful for computation via iteratively re-weighted least squares (IRLS). Especially for redescending loss functions, IRLS can be much more stable than Newton-based methods that solve Ψ(θ)=0\Psi(\theta) = 0 (because Ψ(θ)\Psi'(\theta) can approach zero, leading the Newton steps to diverge).

Choosing the Scaling Factors sis_i

The scaling factors sis_i are required to ensure that θ̃\tilde\theta is equivariant under re-scaling of the YiY_i. In conventional applications, the YiY_i are “raw” data points and the scaling factors si=ss_i = s are chosen to be an ancillary estimate of the scale of the YiY_i (e.g., the median absolute deviation or MAD). In this situation, the scaling factors are constant over ii, so they factor out of the estimating equation and cancel out in the numerator and denominator of the normalized weights wiw_i.

In the present application, item-specific scaling factors sis_i are available because we can derive Σ0\Sigma_0 (the asymptotic covariance matrix of the YiY_i under the null hypothesis of no DIF) by applying the Delta method to the item parameter estimates. As shown above, this somewhat complicates the relationship among the different definitions of θ̃\tilde\theta, because it is now important to keep track of how the item-specific scaling factors appear in the different definitions. However, the item-specific scaling factors are worth this additional complication for the following three reasons, all of which were noted in Halpin (2022).

  • First, obtaining item-specific scaling factors sis_i analytically from Σ0\Sigma_0 means that we no longer require an ancillary estimate based on the scale of the realized values of YiY_i. This is important because it leads the resulting estimator to be highly robust to DIF. This is also the main detail that separates the proposed approach from that considered (and dismissed) by Stocking and Lord (1984). Stocking and Lord used s=MAD(Yi)s = MAD(Y_i). However, the MAD has a breakdown point of 1/4, so any estimator of θ\theta that uses s=MAD(Yi)s = MAD(Y_i) will breakdown if 1/4\geq 1/4 of the items exhibit DIF (see Huber and Roncetti, 2009, chapter 6). By contrast Σ0\Sigma_0 does not depend directly on the scale of the realized values of YiY_i – it can be computed directly from the item parameters. The overall result is that the robustness of the resulting M-estimator no longer depends on that of an ancillary scaling factor.

  • The value of YiY_i appears in the expression for V0(Yi)V_0(Y_i). Thus, V̂0(Yi)\widehat{V}_0(Y_i) may yet be contaminated by DIF, leading to the potential for “masking”. Although this problem is not as severe as when using an ancillary estimate like the MAD, it is still a potential concern. This problem can be avoided as follows. The null hypothesis that the item does not exhibit DIF gives Yipθ0Y_i \overset{p}\rightarrow\theta_0. This motivates using the substitution Yi=θY_i = \theta^\star when estimating Σ0\Sigma_0, where θ\theta^\star is a consistent, high-breakdown estimate of θ\theta (e.g., the median). The overall result is a plug-in estimator of V0(Yi){V}_0(Y_i) that is robust to DIF.

  • Third, using item-specific scaling factors implies that we can downweight items with DIF at the desired asymptotic false positive rate during IRLS-based estimation. For example, if we choose si=V0(Yi)s_i = \sqrt{V_0(Y_i)} and kk as the 1α/21-\alpha/2 quantile of N(0,1)N(0, 1), then items are down-weighted to zero once |Yi||Y_i| lies beyond (1α)×100(1-\alpha) \times 100% confidence interval (CI) centered at θ\theta. In this way, DIF detection arises as a by-product of robust scaling.

Halpin (2022) chose si=V0(Yi)s_i = V_0(Y_i) based on a (flawed) argument about efficiency in the absence of DIF, which also complicated the choice of the tuning parameter kk. The argument was flawed because (a) it was based on the unstated assumption that Σ0\Sigma_0 is diagonal, which is not true for many IRT estimators, and (b) it did not account for the scaling factors that appear outside the influence function in the estimating equation (see point 2 in the previous section).

To address these issues, the robust DIF estimator has been updated to use si=V0(Yi)s_i = \sqrt{V_0(Y_i)} with tuning parameter kk based on the asymptotic CI rationale outlined above. This approach ignores sampling variation in θ̃\tilde\theta when downweighting items with DIF. A more accurate downweighting procedure could instead be based on si=V0(Yiθ̃)s_i = \sqrt{V_0(Y_i - \tilde\theta)}. Since YiY_i and θ̃\tilde\theta are positively correlated, V0(Yiθ̃)V0(Yi)V_0(Y_i - \tilde\theta) \leq V_0(Y_i). Thus, the flagging procedure based on V0(Yi)V_0(Y_i) is somewhat anti-conservative. To address this issue, it is recommended to compute item-by-item tests of DIF using a standard Wald test following estimation of θ̃\tilde\theta:

zi=Yiθ̃V0(Yiθ̃). z_i = \frac{Y_i - \tilde\theta}{\sqrt{V_0(Y_i - \tilde\theta)}}.

Note that modifying the estimator θ̃\tilde\theta to instead use si=V0(Yiθ̃)s_i = \sqrt{V_0(Y_i - \tilde\theta)} is possible (see Halpin), but this leads to complications obtaining its asymptotic distribution. In practice, these complications do not appear to be worth the trouble as there is little change in finite sample performance when using the simpler approach outlined above.

The Asymptotic Distribution of θ̃\tilde\theta

Halpin (2022) obtained the asymptotic distribution of θ̃\tilde\theta using the Delta method and the implicit function theorem. The derivation is recapitulated here for general (i.e., non-diagonal) Σ0\Sigma_0, and the results presented in Halpin (2022) are seen to follow from the assumption that Σ0\Sigma_0 is diagonal.

The estimator θ̃=g(𝐘)\tilde\theta = g(\boldsymbol{Y}) is implicitly defined as the solution to the estimating equation

Ψ(θ;𝐘)=i=1mψ((Yiθ)/si)si=0. \Psi(\theta; \boldsymbol{Y}) = \sum_{i=1}^m \frac{\psi\left({(Y_i-\theta})/{s_i}\right)}{s_i} = 0.

Let the asymptotic distribution of 𝐘\boldsymbol{Y} be denoted as

n(𝐘𝛍)pN(0,Σ)\sqrt{n}(\boldsymbol{Y}-\boldsymbol{\mu}) \overset{p}{\rightarrow} N(0,\Sigma) with the null hypothesis of no DIF leading to 𝛍=θ0𝟏\boldsymbol{\mu} = \theta_0\boldsymbol{1} and Σ=Σ0\Sigma = \Sigma_0. Also let θ\theta^\star be defined as any solution to the population estimating equation E𝐘[Ψ(θ;𝐘)]=0E_\boldsymbol{Y}[\Psi(\theta; \boldsymbol{Y})]= 0. There may be multiple local solutions when using a redescending loss function, and the asymptotic results described here apply to any local solution. In practice, local minima can be diagnosed by plotting R(θ)R(\theta) over a grid of θ\theta values.

The following assumptions are used:

  • A1: ψ(u)\psi(u) continuously differentiable.
  • A2: ψ(u)\psi(u) is odd (i.e. ψ(u)=ψ(u)\psi(-u) = -\psi(u)).
  • A3: ψ(0)0\psi'(0) \neq 0.
  • A4: Ψ(θ;𝛍)0\Psi'(\theta^\star; \boldsymbol{\mu}) \neq 0.

A1 allows the Delta method to be applied to g(𝐘)g(\boldsymbol{Y}). A2 ensures that θ0\theta_0 is a solution to E𝐘[Ψ(θ;𝐘)]E_\boldsymbol{Y}[\Psi(\theta; \boldsymbol{Y})] under the null hypothesis. A3 and continuity (A1) imply that the population estimating equation is monotone around θ0\theta_0 and hence that θ0\theta_0 is a locally unique solution. A4 is required by the implicit function theorem. Under the null hypothesis, A3 implies A4, but in general the two assumptions are distinct.

Applying the Delta method gives (using A1)

n(θ̃θ)pN(0,g(𝛍)Σg(𝛍)).\sqrt{n}(\tilde\theta-\theta^\star) \overset{p}{\rightarrow} N\left(0,\ \nabla g(\boldsymbol{\mu})^\top \Sigma \; \nabla g(\boldsymbol{\mu})\right). The gradient of g(𝐘)g(\boldsymbol{Y}) is obtained from the implicit function theorem (using A4):

g(𝐘)=(Ψ(θ;𝐘)θ)1Ψ(θ;𝐘)𝐘.\nabla g(\boldsymbol{Y}) = -\left(\frac{\partial\Psi(\theta;\boldsymbol{Y})}{\partial\theta}\right)^{-1} \frac{\partial\Psi(\theta;\boldsymbol{Y})}{\partial \boldsymbol{Y}}.

Evaluating the partial derivatives gives

Ψ(θ;𝐘)θ=i=1mψ((Yiθ)/si)si2 \frac{\partial\Psi(\theta;\boldsymbol{Y})}{\partial\theta} = - \sum_{i=1}^m \frac{\psi'((Y_i - \theta) / s_i)}{s_i^2}

and Ψ(θ;𝐘)Yi=ψ((Yiθ)/si)si2.\frac{\partial\Psi(\theta;\boldsymbol{Y})}{\partial Y_i} = \frac{\psi'((Y_i - \theta) / s_i)}{s_i^2}.

Therefore the gradient g(𝐘)\nabla g(\boldsymbol{Y}) has elements

g(𝐘)Yi=ψ((Yiθ)/si)/si2j=1mψ((Yjθ)/sj)/sj2. \frac{\partial g(\boldsymbol{Y})}{\partial Y_i} = \frac{\psi'((Y_i - \theta) / s_i) /s_i^2}{\sum_{j=1}^m \psi'((Y_j - \theta) / s_j) / s_j^2}.

The foregoing results provide the general (i.e., non-null) distribution of θ̃\tilde\theta.

Next we consider the null distribution. First we show that θ=θ0\theta^\star = \theta_0 and then derive V0(θ̃)V_0(\tilde\theta). The null hypothesis is that 𝛍=θ0𝟏\boldsymbol{\mu} = \theta_0 \boldsymbol{1}. This implies that the standardized residuals U0i=(Yiθ0)/siU_{0i} =(Y_i - \theta_0) /s_i are symmetrically distributed about zero. Combined with the assumption that ψ(u)\psi(u) is odd (A2), this gives θ=θ0\theta^\star = \theta_0 by the following argument:

E𝐘[Ψ(θ0;𝐘)]=iE𝐘[ψ(U0i)]=iE𝐘[ψ(U0i)]=iE𝐘[ψ(U0i)]=E𝐘[Ψ(θ0;𝐘)]. E_\boldsymbol{Y}[\Psi(\theta_0; \boldsymbol{Y})] = \sum_i E_\boldsymbol{Y}[\psi(U_{0i})] = \sum_i E_\boldsymbol{Y}[\psi(-U_{0i})] = \sum_i E_\boldsymbol{Y}[-\psi(U_{0i})] = - E_\boldsymbol{Y}[\Psi(\theta_0; \boldsymbol{Y})]. The second equality follows from the symmetry of U0iU_{0i} about zero and the third from A2. The chain of equalities shows that E𝐘[Ψ(θ0;𝐘)]=0E_\boldsymbol{Y}[\Psi(\theta_0; \boldsymbol{Y})] = 0. Together A1 and A3 ensure that this is a locally unique solution.

To obtain the null variance, we evaluate the gradient at 𝐘=𝛍=θ0𝟏\boldsymbol{Y} = \boldsymbol{\mu} = \theta_0 \boldsymbol{1}:

g(𝐘)Yi|𝐘=θ0𝟏=ψ((θ0θ0)/si)/si2j=1mψ((θ0θ0)/sj)/sj2=1/si2j=1m1/sj2.\left. \frac{\partial g(\boldsymbol{Y})}{\partial Y_i} \right|_{\boldsymbol{Y} = \theta_0\boldsymbol{1}} = \frac{\psi'((\theta_0 - \theta_0) / s_i) /s_i^2}{\sum_{j=1}^m \psi'((\theta_0 - \theta_0) / s_j) / s_j^2} = \frac{1 /s_i^2}{\sum_{j=1}^m 1 / s_j^2}.

The second equality follows from A3, which implies that ψ(0)\psi'(0) is equal to a non-zero constant that factors out of the numerator and denominator.

Finally, using si=V0(Yi)s_i = \sqrt{V_0(Y_i)} the gradient g(𝛍)\nabla g(\boldsymbol{\mu}) becomes a vector of precision weights. Letting 𝐩=(p1,,pn)\boldsymbol{p} = (p_1, \dots, p_n)^{\top} denote the vector of precision weights, we can write the asymptotic null distribution of θ̃\tilde\theta as

n(θ̃θ0)pN(0,𝐩Σ0𝐩).\sqrt{n}(\tilde\theta-\theta_0) \overset{p}{\rightarrow} N\left(0,\boldsymbol{p}^\top \Sigma_0 \; \boldsymbol{p}\right).

Under the additional assumption that Σ0\Sigma_0 is diagonal, the resulting expression for the null variance of θ̃\tilde \theta is

V0(θ̃)=i=1mpi2V0(Yi)=i=1m(1/V0(Yi)j=1m1/V0(Yi))2V0(Yi)=(j=1m1/V0(Yi))1. V_0(\tilde \theta) = \sum_{i=1}^{m}p_i^2 \, V_0(Y_i) = \sum_{i=1}^{m} \left(\frac{1/V_0(Y_i)}{\sum_{j=1}^{m}1/V_0(Y_i)}\right)^2 V_0(Y_i) = \left(\sum_{j=1}^{m} 1/V_0(Y_i)\right)^{-1}.

This is the result given in part (a) of Theorem 1 in Halpin (2022).

A similar argument gives the asymptotic null distribution of Yiθ̃Y_i - \tilde \theta as

n(Yiθ̃)pN(0,(𝐞i𝐩)Σ0(𝐞i𝐩)) \sqrt{n}(Y_ i - \tilde \theta) \overset{p}{\rightarrow} N \left(0,\ (\boldsymbol{e}_i - \boldsymbol{p})^\top \Sigma_0 \; (\boldsymbol{e}_i - \boldsymbol{p})\right) where 𝐞i\boldsymbol{e}_i is the ii-th column of the identity matrix. Under the additional assumption that Σ0\Sigma_0 is diagonal, the resulting expression for the variance is

V0(Yiθ̃)=V0(Yi)+V0(θ̃)2piV0(Yi)=V0(Yi)+V0(θ̃)2V0(θ̃)=V0(Yi)V0(θ̃)V_0(Y_i - \tilde \theta) = V_0(Y_i) + V_0(\tilde\theta) - 2 p_i V_0(Y_i) = V_0(Y_i) + V_0(\tilde \theta) - 2V_0(\tilde\theta) = V_0(Y_i) - V_0(\tilde\theta) This is the result given in part (b) of Theorem 1 in Halpin (2022).

Halpin (2022) used the same overall approach to compare two esimates of θ\theta – the unweighted mean of YiY_i and the robust estimator outlined above.

Implementation via IRLS

The robust DIF estimator θ̃\tilde\theta can be computed using IRLS based on the weighted mean definition given above. The IRLS algorithm is as follows:

  1. Initialize θ(0)\theta^{(0)} and set iteration counter t=0t = 0.
  2. Compute standardized residuals ui(t)=(Yiθ(t))/siu_i^{(t)} = (Y_i - \theta^{(t)}) / s_i.
  3. Compute weights wi(θ(t))=ψ(ui(t))/ui(t)w_i(\theta^{(t)}) = \psi(u_i^{(t)}) / u_i^{(t)} with wi*(0)1w^*_i(0) \equiv 1.
  4. Update θ(t+1)=i=1mwi(θ(t))Yi\theta^{(t+1)} = \sum_{i = 1}^{m} w_i(\theta^{(t)})\, Y_i with

wi(θ(t))=wi*(θ(t))/si2j=1mwj*(θ(t))/sj2w_i(\theta^{(t)}) = \frac{w^*_i(\theta^{(t)}) / s_i^2}{\sum_{j=1}^{m} w^*_j(\theta^{(t)}) / s_j^2}.

  1. If |θ(t+1)θ(t)|<ϵ|\theta^{(t+1)} - \theta^{(t)}| < \epsilon stop; else set t=t+1t = t + 1 and return to step 2.

Once the algorithm has converged, set θ̃=θ(t+1)\tilde\theta = \theta^{(t+1)}. Item-level DIF tests can be conducted using the Wald test statistic given above or the multi-parameter variant given in Halpin (2022).

References

Halpin, P.F. (2022) Differential Item Functioning Via Robust Scaling. Arxiv Preprint. https://arxiv.org/abs/2207.04598. Published in Psychometrika in 2024 under the same title.

Halpin, P.F. (2024) Differential Test Functioning Via Robust Scaling. Arxiv Preprint. https://arxiv.org/abs/2409.03502.

Halpin, P. F., & Gilbert, J. (2025). Testing Whether Reported Treatment Effects Are Unduly Influenced by Item-Level Heterogeneity. PsyArxiv Preprint. https://doi.org/10.31234/osf.io/9ru45_v1