robustDIF Technical Notes

These notes outline an updated version of the robust DIF procedure described in Halpin (2022) that is now implemented by the robustDIF package.

Assume two groups of respondents with sample sizes $n_1$ and $n_2$ and let $n = n_1 + n_2$ . Also let $Y = (Y_1,\dots,Y_m)^\top$ denote item-level statistics derived from the parameter estimates of items $i = 1, \dots m$ . The asymptotic arguments presented below assume that $n_1$ and $n_2$ go to infinity at the same rate but that the number of items $m$ remains fixed and finite.

The $Y_i$ are chosen so that $\sqrt{n}(\boldsymbol{Y} - \theta_0 \mathbf 1)\overset{d}\rightarrow N(0, \Sigma_0)$ under the null hypothesis of no DIF on any item. In this set-up, DIF means that $Y_i$ converges in probability to a fixed value other than $\theta_0$ . Some specific choices of $Y_i$ are detailed in Halpin (2024, 2025) and Halpin & Gilbert (2025). The notation $V_0(\boldsymbol{Y}) = \Sigma_0/n$ is used to denote the finite sample variance of $\boldsymbol{Y}$ , and similarly for other statistics.

The robust DIF procedure can be seen as solving two interrelated problems. First, it provides an M-estimator of $\theta_0$ that is highly robust to DIF. Second, it provides a procedure for flagging items with DIF, which happens automatically as a by-product of estimating $\theta_0$ . Standard Wald tests of DIF are also available.

Halpin (2022) used the (unstated) assumption that $\Sigma_0$ is diagonal and attempted to combine efficiency and robustness in a way that did not clearly separate these two opposing considerations. These notes address these shortcomings and implement a new, simpler and more general, version of the robust DIF procedure. The analytical details are stated so that they apply to any bounded, redescending loss function, although for computation the focus is on Tukey’s bisquare. Differences with Halpin (2022) are pointed out as they arise.

The Updated Robust DIF Procedure

Defining the Estimator

The robust estimator $\tilde\theta$ can be defined in three related ways. Let $u_i = u_i(\theta) = (Y_i - \theta)/{s_i}$ where $s_i>0$ are item-specific scaling factors to be chosen subsequently. The three definitions of $\tilde\theta$ are as follows.

The minimizing argument of the loss function: $R(\theta) = \sum_{i=1}^m \rho(u_i(\theta)).$ This definition is useful for deriving results about the robustness of $\tilde\theta$ . For redescending loss functions there exists constants $c$ and $k$ such that $\rho(u_i) = c$ whenever $|u_i| \geq k$ . It is usual to scale $\rho$ so that $c = 1$ . The constant $k$ is treated as a tuning parameter that serves to identify outliers (i.e., items with DIF).
The solution to the estimating equation: $\Psi(\theta) = \sum_{i=1}^m \psi(u_i(\theta))/s_i = 0,$ where $\psi(u) = \rho'(u)$ . The influence function $\Psi(\theta)$ is important for obtaining the variance of $\tilde\theta$ .
A weighted mean that is obtained by defining the weights $w^*_i(\theta) = \psi(u_i(\theta)) / u_i(\theta)$ and substituting these into the estimating equation to get: $\theta = \sum_{i = 1}^{m} w_i(\theta)\, Y_i \qquad \text{with} \qquad w_i(\theta) = \frac{w^*_i(\theta) / s_i^2}{\sum_{j=1}^{m} w^*_j(\theta) / s_j^2}.$ By convention, $w^*_i(0) \equiv 1$ to avoid division by zero. Also note that, when $|u_i| \geq k$ , $\psi(u_i) = w^*_i(\theta) = 0$ , so that outliers (as defined by $k$ ) are “redescended” to zero. The weighted mean is useful for computation via iteratively re-weighted least squares (IRLS). Especially for redescending loss functions, IRLS can be much more stable than Newton-based methods that solve $\Psi(\theta) = 0$ (because $\Psi'(\theta)$ can approach zero, leading the Newton steps to diverge).

Choosing the Scaling Factors $s_i$

The scaling factors $s_i$ are required to ensure that $\tilde\theta$ is equivariant under re-scaling of the $Y_i$ . In conventional applications, the $Y_i$ are “raw” data points and the scaling factors $s_i = s$ are chosen to be an ancillary estimate of the scale of the $Y_i$ (e.g., the median absolute deviation or MAD). In this situation, the scaling factors are constant over $i$ , so they factor out of the estimating equation and cancel out in the numerator and denominator of the normalized weights $w_i$ .

In the present application, item-specific scaling factors $s_i$ are available because we can derive $\Sigma_0$ (the asymptotic covariance matrix of the $Y_i$ under the null hypothesis of no DIF) by applying the Delta method to the item parameter estimates. As shown above, this somewhat complicates the relationship among the different definitions of $\tilde\theta$ , because it is now important to keep track of how the item-specific scaling factors appear in the different definitions. However, the item-specific scaling factors are worth this additional complication for the following three reasons, all of which were noted in Halpin (2022).

First, obtaining item-specific scaling factors $s_i$ analytically from $\Sigma_0$ means that we no longer require an ancillary estimate based on the scale of the realized values of $Y_i$ . This is important because it leads the resulting estimator to be highly robust to DIF. This is also the main detail that separates the proposed approach from that considered (and dismissed) by Stocking and Lord (1984). Stocking and Lord used $s = MAD(Y_i)$ . However, the MAD has a breakdown point of 1/4, so any estimator of $\theta$ that uses $s = MAD(Y_i)$ will breakdown if $\geq 1/4$ of the items exhibit DIF (see Huber and Roncetti, 2009, chapter 6). By contrast $\Sigma_0$ does not depend directly on the scale of the realized values of $Y_i$ – it can be computed directly from the item parameters. The overall result is that the robustness of the resulting M-estimator no longer depends on that of an ancillary scaling factor.
The value of $Y_i$ appears in the expression for $V_0(Y_i)$ . Thus, $\widehat{V}_0(Y_i)$ may yet be contaminated by DIF, leading to the potential for “masking”. Although this problem is not as severe as when using an ancillary estimate like the MAD, it is still a potential concern. This problem can be avoided as follows. The null hypothesis that the item does not exhibit DIF gives $Y_i \overset{p}\rightarrow\theta_0$ . This motivates using the substitution $Y_i = \theta^\star$ when estimating $\Sigma_0$ , where $\theta^\star$ is a consistent, high-breakdown estimate of $\theta$ (e.g., the median). The overall result is a plug-in estimator of ${V}_0(Y_i)$ that is robust to DIF.
Third, using item-specific scaling factors implies that we can downweight items with DIF at the desired asymptotic false positive rate during IRLS-based estimation. For example, if we choose $s_i = \sqrt{V_0(Y_i)}$ and $k$ as the $1-\alpha/2$ quantile of $N(0, 1)$ , then items are down-weighted to zero once $|Y_i|$ lies beyond $(1-\alpha) \times 100%$ confidence interval (CI) centered at $\theta$ . In this way, DIF detection arises as a by-product of robust scaling.

Halpin (2022) chose $s_i = V_0(Y_i)$ based on a (flawed) argument about efficiency in the absence of DIF, which also complicated the choice of the tuning parameter $k$ . The argument was flawed because (a) it was based on the unstated assumption that $\Sigma_0$ is diagonal, which is not true for many IRT estimators, and (b) it did not account for the scaling factors that appear outside the influence function in the estimating equation (see point 2 in the previous section).

To address these issues, the robust DIF estimator has been updated to use $s_i = \sqrt{V_0(Y_i)}$ with tuning parameter $k$ based on the asymptotic CI rationale outlined above. This approach ignores sampling variation in $\tilde\theta$ when downweighting items with DIF. A more accurate downweighting procedure could instead be based on $s_i = \sqrt{V_0(Y_i - \tilde\theta)}$ . Since $Y_i$ and $\tilde\theta$ are positively correlated, $V_0(Y_i - \tilde\theta) \leq V_0(Y_i)$ . Thus, the flagging procedure based on $V_0(Y_i)$ is somewhat anti-conservative. To address this issue, it is recommended to compute item-by-item tests of DIF using a standard Wald test following estimation of $\tilde\theta$ :

$z_i = \frac{Y_i - \tilde\theta}{\sqrt{V_0(Y_i - \tilde\theta)}}.$

Note that modifying the estimator $\tilde\theta$ to instead use $s_i = \sqrt{V_0(Y_i - \tilde\theta)}$ is possible (see Halpin), but this leads to complications obtaining its asymptotic distribution. In practice, these complications do not appear to be worth the trouble as there is little change in finite sample performance when using the simpler approach outlined above.

The Asymptotic Distribution of $\tilde\theta$

Halpin (2022) obtained the asymptotic distribution of $\tilde\theta$ using the Delta method and the implicit function theorem. The derivation is recapitulated here for general (i.e., non-diagonal) $\Sigma_0$ , and the results presented in Halpin (2022) are seen to follow from the assumption that $\Sigma_0$ is diagonal.

The estimator $\tilde\theta = g(\boldsymbol{Y})$ is implicitly defined as the solution to the estimating equation

$\Psi(\theta; \boldsymbol{Y}) = \sum_{i=1}^m \frac{\psi\left({(Y_i-\theta})/{s_i}\right)}{s_i} = 0.$

Let the asymptotic distribution of $\boldsymbol{Y}$ be denoted as

$\sqrt{n}(\boldsymbol{Y}-\boldsymbol{\mu}) \overset{p}{\rightarrow} N(0,\Sigma)$ with the null hypothesis of no DIF leading to $\boldsymbol{\mu} = \theta_0\boldsymbol{1}$ and $\Sigma = \Sigma_0$ . Also let $\theta^\star$ be defined as any solution to the population estimating equation $E_\boldsymbol{Y}[\Psi(\theta; \boldsymbol{Y})]= 0$ . There may be multiple local solutions when using a redescending loss function, and the asymptotic results described here apply to any local solution. In practice, local minima can be diagnosed by plotting $R(\theta)$ over a grid of $\theta$ values.

The following assumptions are used:

A1: $\psi(u)$ continuously differentiable.
A2: $\psi(u)$ is odd (i.e. $\psi(-u) = -\psi(u)$ ).
A3: $\psi'(0) \neq 0$ .
A4: $\Psi'(\theta^\star; \boldsymbol{\mu}) \neq 0$ .

A1 allows the Delta method to be applied to $g(\boldsymbol{Y})$ . A2 ensures that $\theta_0$ is a solution to $E_\boldsymbol{Y}[\Psi(\theta; \boldsymbol{Y})]$ under the null hypothesis. A3 and continuity (A1) imply that the population estimating equation is monotone around $\theta_0$ and hence that $\theta_0$ is a locally unique solution. A4 is required by the implicit function theorem. Under the null hypothesis, A3 implies A4, but in general the two assumptions are distinct.

Applying the Delta method gives (using A1)

$\sqrt{n}(\tilde\theta-\theta^\star) \overset{p}{\rightarrow} N\left(0,\ \nabla g(\boldsymbol{\mu})^\top \Sigma \; \nabla g(\boldsymbol{\mu})\right).$ The gradient of $g(\boldsymbol{Y})$ is obtained from the implicit function theorem (using A4):

$\nabla g(\boldsymbol{Y}) = -\left(\frac{\partial\Psi(\theta;\boldsymbol{Y})}{\partial\theta}\right)^{-1} \frac{\partial\Psi(\theta;\boldsymbol{Y})}{\partial \boldsymbol{Y}}.$

Evaluating the partial derivatives gives

$\frac{\partial\Psi(\theta;\boldsymbol{Y})}{\partial\theta} = - \sum_{i=1}^m \frac{\psi'((Y_i - \theta) / s_i)}{s_i^2}$

and $\frac{\partial\Psi(\theta;\boldsymbol{Y})}{\partial Y_i} = \frac{\psi'((Y_i - \theta) / s_i)}{s_i^2}.$

Therefore the gradient $\nabla g(\boldsymbol{Y})$ has elements

$\frac{\partial g(\boldsymbol{Y})}{\partial Y_i} = \frac{\psi'((Y_i - \theta) / s_i) /s_i^2}{\sum_{j=1}^m \psi'((Y_j - \theta) / s_j) / s_j^2}.$

The foregoing results provide the general (i.e., non-null) distribution of $\tilde\theta$ .

Next we consider the null distribution. First we show that $\theta^\star = \theta_0$ and then derive $V_0(\tilde\theta)$ . The null hypothesis is that $\boldsymbol{\mu} = \theta_0 \boldsymbol{1}$ . This implies that the standardized residuals $U_{0i} =(Y_i - \theta_0) /s_i$ are symmetrically distributed about zero. Combined with the assumption that $\psi(u)$ is odd (A2), this gives $\theta^\star = \theta_0$ by the following argument:

$E_\boldsymbol{Y}[\Psi(\theta_0; \boldsymbol{Y})] = \sum_i E_\boldsymbol{Y}[\psi(U_{0i})] = \sum_i E_\boldsymbol{Y}[\psi(-U_{0i})] = \sum_i E_\boldsymbol{Y}[-\psi(U_{0i})] = - E_\boldsymbol{Y}[\Psi(\theta_0; \boldsymbol{Y})].$ The second equality follows from the symmetry of $U_{0i}$ about zero and the third from A2. The chain of equalities shows that $E_\boldsymbol{Y}[\Psi(\theta_0; \boldsymbol{Y})] = 0$ . Together A1 and A3 ensure that this is a locally unique solution.

To obtain the null variance, we evaluate the gradient at $\boldsymbol{Y} = \boldsymbol{\mu} = \theta_0 \boldsymbol{1}$ :

$\left. \frac{\partial g(\boldsymbol{Y})}{\partial Y_i} \right|_{\boldsymbol{Y} = \theta_0\boldsymbol{1}} = \frac{\psi'((\theta_0 - \theta_0) / s_i) /s_i^2}{\sum_{j=1}^m \psi'((\theta_0 - \theta_0) / s_j) / s_j^2} = \frac{1 /s_i^2}{\sum_{j=1}^m 1 / s_j^2}.$

The second equality follows from A3, which implies that $\psi'(0)$ is equal to a non-zero constant that factors out of the numerator and denominator.

Finally, using $s_i = \sqrt{V_0(Y_i)}$ the gradient $\nabla g(\boldsymbol{\mu})$ becomes a vector of precision weights. Letting $\boldsymbol{p} = (p_1, \dots, p_n)^{\top}$ denote the vector of precision weights, we can write the asymptotic null distribution of $\tilde\theta$ as

$\sqrt{n}(\tilde\theta-\theta_0) \overset{p}{\rightarrow} N\left(0,\boldsymbol{p}^\top \Sigma_0 \; \boldsymbol{p}\right).$

Under the additional assumption that $\Sigma_0$ is diagonal, the resulting expression for the null variance of $\tilde \theta$ is

$V_0(\tilde \theta) = \sum_{i=1}^{m}p_i^2 \, V_0(Y_i) = \sum_{i=1}^{m} \left(\frac{1/V_0(Y_i)}{\sum_{j=1}^{m}1/V_0(Y_i)}\right)^2 V_0(Y_i) = \left(\sum_{j=1}^{m} 1/V_0(Y_i)\right)^{-1}.$

This is the result given in part (a) of Theorem 1 in Halpin (2022).

A similar argument gives the asymptotic null distribution of $Y_i - \tilde \theta$ as

$\sqrt{n}(Y_ i - \tilde \theta) \overset{p}{\rightarrow} N \left(0,\ (\boldsymbol{e}_i - \boldsymbol{p})^\top \Sigma_0 \; (\boldsymbol{e}_i - \boldsymbol{p})\right)$ where $\boldsymbol{e}_i$ is the $i$ -th column of the identity matrix. Under the additional assumption that $\Sigma_0$ is diagonal, the resulting expression for the variance is

$V_0(Y_i - \tilde \theta) = V_0(Y_i) + V_0(\tilde\theta) - 2 p_i V_0(Y_i) = V_0(Y_i) + V_0(\tilde \theta) - 2V_0(\tilde\theta) = V_0(Y_i) - V_0(\tilde\theta)$ This is the result given in part (b) of Theorem 1 in Halpin (2022).

Halpin (2022) used the same overall approach to compare two esimates of $\theta$ – the unweighted mean of $Y_i$ and the robust estimator outlined above.

Implementation via IRLS

The robust DIF estimator $\tilde\theta$ can be computed using IRLS based on the weighted mean definition given above. The IRLS algorithm is as follows:

Initialize $\theta^{(0)}$ and set iteration counter $t = 0$ .
Compute standardized residuals $u_i^{(t)} = (Y_i - \theta^{(t)}) / s_i$ .
Compute weights $w_i(\theta^{(t)}) = \psi(u_i^{(t)}) / u_i^{(t)}$ with $w^*_i(0) \equiv 1$ .
Update $\theta^{(t+1)} = \sum_{i = 1}^{m} w_i(\theta^{(t)})\, Y_i$ with

$w_i(\theta^{(t)}) = \frac{w^*_i(\theta^{(t)}) / s_i^2}{\sum_{j=1}^{m} w^*_j(\theta^{(t)}) / s_j^2}$ .

If $|\theta^{(t+1)} - \theta^{(t)}| < \epsilon$ stop; else set $t = t + 1$ and return to step 2.

Once the algorithm has converged, set $\tilde\theta = \theta^{(t+1)}$ . Item-level DIF tests can be conducted using the Wald test statistic given above or the multi-parameter variant given in Halpin (2022).

References

Halpin, P.F. (2022) Differential Item Functioning Via Robust Scaling. Arxiv Preprint. https://arxiv.org/abs/2207.04598. Published in Psychometrika in 2024 under the same title.

Halpin, P.F. (2024) Differential Test Functioning Via Robust Scaling. Arxiv Preprint. https://arxiv.org/abs/2409.03502.

Halpin, P. F., & Gilbert, J. (2025). Testing Whether Reported Treatment Effects Are Unduly Influenced by Item-Level Heterogeneity. PsyArxiv Preprint. https://doi.org/10.31234/osf.io/9ru45_v1

Peter F Halpin

2026-04-23

The Updated Robust DIF Procedure

Defining the Estimator

Choosing the Scaling Factors $s_i$

The Asymptotic Distribution of $\tilde\theta$

Implementation via IRLS

References

robustDIF Technical Notes

Peter F Halpin

2026-04-23

The Updated Robust DIF Procedure

Defining the Estimator

Choosing the Scaling Factors sis_i

The Asymptotic Distribution of θ̃\tilde\theta

Implementation via IRLS

References

Choosing the Scaling Factors $s_i$

The Asymptotic Distribution of $\tilde\theta$