Part 1: Measurement Invariance

Peter F. Halpin

Overview of Workshop

Part 1. Intro + factor analysis + MI
Part 2. IRT + DIF
Part 3. Robust scaling + DIF + DTF

Abbreviations
- DIF = differential item functioning
- DTF = differential test functioning
- IRT = item response theory
- MI = measurement invariance

Overview of Part 1

General definitions of MI and DIF
Factor model for categorical data
“Levels” of MI defined for categorical data
Testing MI by comparing models (chi-square difference tests)
Worked example

Organization

Website: peterhalpin.github.io/RDIF-workshop/
Slides: These slides
Notes: These slides DOCX format (translated, editable)
Code: Examples code from these slides + extra analyses

MI and DIF in General

First goal: Define the main issues

Intuitive definitions (Bauer, 2017)

Measurement invariance (MI): An assessment performs the same across different groups of respondents
Differential item functioning (DIF): An item performs differently across different groups of respondents
Relation:
- If MI holds, no items exhibit DIF
- If MI does not hold, at least one item exhibits DIF

Examples (Curley & Schmitt, 1993)

A general framework

To discuss MI / DIF in general, represent psychometric models using definition of marginal distribution

\[p(\mathbf{x}) = \int p(\mathbf{x} \mid \eta) \, p(\eta) \, d\eta\]

$\mathbf{x} = [X_1, X_2, \dots, X_J]$: observed variables (assessment items)
$\eta$: latent variable (trait, factor, construct)
$p$: probability distribution (mass, density)

A general framework

Different assumptions define different models (e.g, Holland & Rosenbaum, 1986)

\[p(\mathbf{x}) = \int p(\mathbf{x} \mid \eta) \, p(\eta) \, d\eta\]

Factor analysis: $\mathbf{x}$ and $\eta$ are normally distributed with $\mathbf{x} = \mathbf{\nu} + \Lambda \eta + \mathbf{\epsilon}$
IRT: $\mathbf{x}$ multinomial and $\eta$ is normal
….

A general framework

This representation shows that there are two parts in a psychometric model

\[p(\mathbf{x}) = \int {p(\mathbf{x} \mid \eta)} \, {p(\eta)} \, d\eta\]

${\text{The measurement model: } p(\mathbf{x} \mid \eta)}$
${\text{The population model: } p(\eta) }$
Helpful for understanding MI / DF

The measurement model

The conditional distribution $p(\mathbf{x} \mid \eta)$ relates the observed data to the latent trait
Typically assume conditional independence

\[p(\mathbf{x} \mid \eta) = \prod_j p(X_j \mid \eta) \]

The correlations among the items are explained by the latent trait only
- “The items measure the construct”

The population model

The distribution of the latent trait $p(\eta)$ describes how the target construct is distributed
Psychometric models require that we set the scale of the latent trait
- e.g., set $E[\eta] = 0$ and $V[\eta] = 1$
This will turn out to be a complicated aspect of MI /DIF
- Different levels of MI allow for different parameters of the population to be estimated

MI

Let $W$ be any other variable
- Often gender, race, but could be anything

\[p(\mathbf{x} \mid {W}) = \int \prod_j p(X_j \mid \eta, {W}) \, p(\eta \mid {W}) \, d\eta\]

MI: $p(X_j \mid \eta, {W}) = p(X_j \mid \eta)$ for all $j$
Measurement model does not depend on $W$
“The measure is not biased with respect to $W$”

Implications of MI

The marginal model with MI:

\[p(\mathbf{x} \mid {W}) = \int \prod_j p(X_j \mid \eta) \, p(\eta \mid {W}) \, d\eta\]

If groups differ in their observed scores, this must be because they differ on the latent trait

\[ p(\mathbf{x} \mid W) \neq p(\mathbf{x}) \rightarrow p(\eta \mid W) \neq p(\eta)\]

DIF

\[p(\mathbf{x} \mid {W}) = \int \prod_j p(X_j \mid \eta, {W}) \, p(\eta \mid {W}) \, d\eta\]

DIF: $p(X_j \mid \eta, {W}) \neq p(X_j \mid \eta)$ for item $j$
- Measurement model does depend on $W$ for some items
- This is just the opposite of MI
- Sometimes called “measurement bias”

Implications of DIF

\[p(\mathbf{x} \mid {W}) = \int \prod_j p(X_j \mid \eta, {W}) \, p(\eta \mid {W}) \, d\eta\]

If groups differ in their observed scores, this could be because:
- 1. the population model differs over groups
- 1. the measurement model differs over groups
- 1. both

Why is DIF a problem?

\[p(\mathbf{x} \mid {W}) = \int \prod_j p(X_j \mid \eta, {W}) \, p(\eta \mid {W}) \, d\eta\]

For technical reasons, we cannot estimate this model if all items exhibit DIF
- “Circular nature of DIF”
- Will discuss more in Part 2
- For now, just the implications…

Why is DIF a problem?

When we compute and report test scores using $\mathbf{x}$, we are implicitly assuming that items do not exhibit DIF (i.e., are not biased)
If this assumption is mistaken:
- Individuals’ test scores may be biased
- Estimates of groups differences based on observed test scores may biased
- Estimates of impact using latent variable models may be biased
- …

What about partial MI?

Partial MI means that some but not all items exhibit DIF
- Keeping biased items can be OK in some research settings
But the usual goal of DIF analysis is to remove any items with DIF
- i.e., the goal is full MI, not partial MI
- This is still the standard approach in test development – remove items with DIF before reporting scores

What about different models?

We have just seen how to define MI / DIF in general
However
- MI was developed in the factor analysis literature
- DIF was developed in the IRT literature
- See Thissen (2023) for a historical review

What about different models?

Models, tests, conventions, and software for MI and DIF differ due to historical reasons
We can approach both MI and DIF using either model, but it is currently easier to go with the traditional distinctions
- Factor analysis software makes testing MI easy!
- IRT software makes testing DIF easy!
- You can switch this up, but it requires (a bit) more work

Broad comparison between models

Feature	Factor analysis	IRT
Dimensionality of latent trait	multidimensional	Unidimensional (traditionally!)
Treatment of categorical data	Latent response variables	Item response functions
Model estimation	Polychoric correlations (WLS)	Maximum likelihood
Model parameterization	General, many “extra” parameters	Specific, only include parameters used in a given model
Main visualization	Path diagram	Item response functions

Broad comparison between models

Summary

MI / DIF are about the measurement model
- We want to make sure measurement does not depend on, e.g., a person’s gender
Impact is about the population model
- There may or may not be group differences on the target construct
Without (partial) MI, we cannot know if observed differences are due to measurement bias, “true” differences on the target construct, or both

Factor Analysis for Categorical Data

Focus on unidimensional models

Factor model

For continuous observed variables

\[ X_j = \nu_j + \lambda_j \eta + \epsilon_j \]

Assumptions
- $\eta \sim N(\kappa, \phi) \quad \; \; \,\text {and} \quad \epsilon_j \sim N(0, \theta_j)$
- $\text{cov}[\epsilon_j, \eta] = 0 \quad \text {and} \quad \text{cov}[\epsilon_j, \epsilon_{k}] = 0$

Doesn’t work when $X_j$ is categorical!

Latent response variables (LRVs)

Factor analysis deals with categorical data by introducing a new type of latent variable
For each categorical observed variable $X_j$, we assume there exists a latent response variable $X^*_j$
- Not a variable of substantive interest, just a mathematical convenience!!
- Illustrations on following slides

LRVs: Why?

Benefit: can do factor analysis “as usual” with $X^*_j$
- When life gives you lemons, make lemonade
- When life gives you categorical data, make continuous data
Cost: introduced new variables $X^*_j$ (and their parameters) that don’t mean anything
- Will be a bit of nuisance later on

LRVs: How They Work

Figure: Wirth & Edwards, 2007

Two main ideas

First idea: thresholding a latent response variable
- Assumes LRVs are normally distributed
- Allows us to deal with categorical data

Second idea: tetrachoric and polychoric correlations
- Assumes pairs of LRVs are bivariate normal
- Allows us to estimate factor model

Thresholding

Tetrachoric correlation

Correlation of observed responses: Phi-coefficient
Correlation of latent responses: Tetrachoric correlation

Polychoric correlation

Correlation of observed items: Spearman, …
Correlation of latent responses: Polychoric correlation

Summary

LRVs are used to deal with categorical data in factor analysis
The correlations between the LRVs are modeled, rather than modeling the categorical data directly
- These are called tetrachoric and polychoric correlations
All this is done for mathematical convenience!
- LRVs don’t (usually) represent substantive concepts
- They don’t show up in IRT, which is a main difference between models

Back to the Factor Model …

Factor model for categorical data

Step 1: Assume categorical variables $X_j$ with $c = 1, \dots, C$ categories arise from thresholding an LRV

\[ X_j = \left\{ \begin{array}{ccc} 1 & if & -\infty < X_j^* \leq \tau_{j1} \\ 2 & if & \tau_{j1} < X_j^* \leq \tau_{j2} \\ ... & & \\ C & if & \tau_{j,C-1} < X_j^* \leq \infty \end{array} \right.\]

The parameters $\tau_j = [\tau_{j1}, \dots, \tau_{j,C-1}]$ are the called the item thresholds

Factor model for categorical data

Step 2: Factor model for the LRVs

\[ X^*_j = \nu_j + \lambda_j \eta + \epsilon_j \]

Assumptions
- $\eta \sim N(\kappa, \phi) \quad \; \; \,\text {and} \quad \epsilon_j \sim N(0, \theta_j)$
- $\text{cov}[\epsilon_j, \eta] = 0 \quad \text {and} \quad \text{cov}[\epsilon_j, \epsilon_{k}] = 0$
Same as continuous model, but now for the LRVs

Model identification

Model identification for categorical data is complicated
LRVs introduce a lot of parameters we cannot actually estimate
Short version: In the single-group case, the only parameters we can estimate are
- the factor loadings $\lambda_j$
- the thresholds $\tau_j$

Model identification

It gets more complicated when testing for MI
We can estimate $some$ of the excluded parameters, but different authors use different approaches
So, in order to be prepared for MI, its helps to go through the long version of this problem for the single group case …

Model identification

Standardize the latent trait as usual: $\eta \sim N(0, 1)$
- e.g., set $\kappa = 0$ and $\phi = 1$
- Can fix one of the intercepts and factor loadings to 1 instead
For the LRVs, we must also set their scale, and there are two ways to do this
- “Delta parameterization” - recommended for interpretation, default in lavaan
- “Theta parameterization” - can simplify estimation, wont’ discuss much

Delta parameterization

Standardize the LRVs as $X_j^* \sim N(0, 1)$
- i.e., set $\mu_j = 0$ and $\sigma^2_j = 1$
- “Delta” is defined as $\Delta = 1 / \sigma_j$, so equivalent to setting $\Delta = 1$
Equivalent to standardizing continuous data
- Factor loadings can be interpreted as correlations
- Thresholds can be interpreted as z-scores

Implications of Delta parameterization

Setting $\mu_j = 0$ implies the intercepts are also zero:

\[\mu_j = 0 = \nu_j + \lambda_j \kappa = \nu_j + \lambda_j (0) \]

So, $_j = 0 $
Implication: the intercepts of LRVs $\nu_j$ cannot be estimated (fixed to zero)

Implications of Delta parameterization

Setting $\sigma^2_j = 1$ implies the value of the residual variances

\[\sigma^2_j = 1 = \lambda_j^2 \phi + \theta_j = \lambda_j^2 (1) + \theta_j\]

So $\theta_j = 1 - \lambda_j^2$
Implication: the residual variance of the factor model cannot be estimated (fixed to $1 - \lambda_j^2$)

Estimation in lavaan: Code

library(lavaan)
dat <- read.csv("cint_data.csv")

# Model syntax
mod1 <- 'depression =~ cint1 + cint2 + cint4 + cint11 + 
                       cint27 + cint28 + cint29 + cint30'

# Fit model
fit.delta <- cfa(mod1, 
                 data = dat, 
                 std.lv = T,  # standardize latent variable
                 ordered = T) # data are ordered
                 
# Print model summary
summary(fit.delta)

Estimation in lavaan: Output

Summary of Delta parameterization

The only parameters we estimate are factor loadings $\lambda_j$ and the thresholds $\tau_{jc}$
The latent trait is standardized: $\eta \sim N(0, 1)$,
The LRVs are standardized: $X_j^* \sim N(0, 1)$
- Same as setting $\Delta_j = 1/ \sigma_j = 1$
The intercepts are fixed, $\nu_j = 0$
The residual variances are fixed, $\theta_j = 1 - \lambda_j^2$
- In some MI models, can estimate $\Delta_j$, but $\theta$ is still fixed!

Theta parameterization

Instead of setting $\sigma_j = 1$, set the residual variance $\theta_j = 1$
Implies variance of $X_j^*$ is fixed to $\sigma^2_j = \lambda_j^2 + 1$
So the Delta parameter is fixed to

\[ \Delta_j = 1 / \sigma_k = 1 / \sqrt{\lambda^2 + 1}\]

Model interpretation is more complicated since $\sigma^2_j \neq 1$
See coding notes for example

Summary

Factor model for categorical data uses LRVs
- Step 1. Represent categorical data using LRV
- Step 2. Factor analyze LRVs
Convenient trick! But introduces many parameters that we can’t estimate
The only parameters we can estimate are the factor loadings and thresholds
In Delta, these are easy to interpret!
(In Theta, not easy to interpret)

Measurement Invariance

Recap of MI

We want our measurement model to be the same across groups
This will ensure that any group differences on the observed data $\bf x$ are due only to difference on the target construct $\eta$ (i.e., impact)
Important for ensuring unbiased comparisons between groups

Measurement model parameters

For groups $g = 1, 2, ..., G$
The factor loadings, $\lambda_{jg}$
The item thresholds, $\tau_{jg} = [\tau_{j1g}, \tau_{j2g}, \dots, \tau_{jCg}]$

We want to test if these are equal over groups:

\[\lambda_{j1} = \lambda_{j1} = \dots = \lambda_{jG} \]

\[\tau_{jc1} = \tau_{jc2} = \dots = \tau_{jcG} \]

Population model parameters

In a single group, we had to standardize $\eta \sim N(0, 1)$ to estimate the model
In multiple groups, this approach is problematic
e.g., if we set the mean of the factor to be 0 in each group:

\[\kappa_1 = \kappa_2 = \dots \kappa_G = 0\]

We are asserting that all groups have the same mean on the latent trait – this is not an “arbitrary” constraint on the model!

Population model parameters

MI allows us to estimate the population model parameters (see Muthen & Asparouhov 2002, Millsap & Yun-Tien, 2004)
In fact, the goal of MI can be interpreted in terms of placing sufficient constraints on the model to estimate impact
- More on this soon when we talk about “levels” of MI
Even with MI, still need to standardize $\eta \sim N(0,1)$ in one group, called the reference group

Nuisance parameters

What about the LRV parameters?
- $X_j^* \sim N(\mu_j, \sigma^2_j)$
- The intercepts, $\nu_j$
- The residual variances, $\theta_j$
These are technically part of the measurement model
With MI, we can estimate either $\sigma^2_j$ (Delta) or residual variance (Theta)
- Most software will do this by default …

Summary

The measurement parameters:
- The factor loadings, $\lambda_{jg}$
- The item thresholds, $\tau_{jg} = [\tau_{j2g}, \tau_{j2g}, \dots, \tau_{jCg}]$
The population parameters:
- $\eta \sim N(\kappa_g, \phi_g)$
The nuisance (LRV) parameters:
- $X_j^* \sim N(\mu_j, \sigma^2_j)$; intercepts: $\nu_j$; residual variances: $\theta_j$

Levels of Measurement Invariance

configural, weak, metric, scalar, strong, strict, …

There a lots of versions…

Table: Thissen, 2023

Summary of levels: Configural invariance

Measurement model: Same factor pattern over groups (which items go with which factors)
Population model: Not sufficient to estimate impact on any parameter
Not usually interpreted, but is a basis for testing other models

Summary of levels: Weak / metric invariance

Measurement model: All factor loadings are equal over groups
Population model: Sufficient to estimate impact on factor (co-) variances
Can serve as basis for multi-group Structural equation modeling (without mean structure)

Summary of levels: Strong / scalar invariance

Measurement model: All factor loadings and thresholds are equal over groups
Population model: Sufficient to estimate impact on factor (co-) variances and means
Considered acceptable for comparing groups on observed test scores

Summary of levels: Strict invariance

Measurement model: All factor loadings,thresholds, and residual variances are equal over groups
Population model: Sufficient to estimate impact on factor (co-) variances and means
Ensures test scores are equally reliable in both groups

Note: some issues distinguishing strong and strict MI with categorical data (we will see this soon)

The configural model: Recap

Measurement model: Same factor pattern over groups (which items go with which factors)
Population model: Not sufficient to estimate impact on any parameter
Not usually interpreted, but is a basis for testing other models

Configural model: Code

# Model (same as above)
mod1 <- ' depression =~ cint1 + cint2 + cint4 + cint11 + 
                        cint27 + cint28 + cint29 + cint30'
# Fit model
fit.config <- cfa(mod1, 
                  data = dat, 
                  std.lv = T,  
                  ordered = T, 
                  group = "cfemale") # <--- new 
                  
# Print model summary
summary(fit.config)

Configural model: Output

Weak / metric invariance: Recap

Measurement model: All factor loadings are equal over groups
Population model: Sufficient to estimate impact on factor (co-) variances

This model is more exciting in multidimensional settings, when we are interested in the covariance matrix of factors, not just the variance of a single factor

Weak / metric invariance: Code

# Fit model
fit.weak <- cfa(mod1, 
                  data = dat, 
                  std.lv = T,  
                  ordered = T, 
                  group = "cfemale",
                  group.equal = "loadings") # <--- new 
                  
# Print model summary
summary(fit.weak)

Weak / metric invariance: Output

Comparing the models

Nested CFA models can be compared using their chi-square statistics (e.g., Satorra & Bentler, 2001)
Two models are nested if one can be obtained from the other by setting some parameters to fixed values
Let “A” denote the larger model and “B” denote the smaller model
Define: $\chi^2_\text{DIFF} = \chi^2_\text{B} - \chi^2_\text{A} \quad \text{and} \quad df_\text{DIFF} = df_\text{B} - df_\text{A}$
Then $\chi^2_\text{DIFF}$ has central chi-square distribution with $df_\text{DIFF}$, when constrained model is true

Comparing the models: Code

lavTestLRT(fit.config, fit.weak)


Scaled Chi-Squared Difference Test (method = "satorra.2000")

lavaan NOTE:
    The "Chisq" column contains standard test statistics, not the
    robust test that should be reported per model. A robust difference
    test is a function of two standard (not robust) statistics.
 
           Df AIC BIC  Chisq Chisq diff Df diff Pr(>Chisq)  
fit.config 40         37.534                                
fit.weak   47         57.126     12.091       7    0.09761 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Do not reject weak invariance using $\alpha = .05$
8 factor loadings constrained, estimated 1 variance for the latent trait, so $df = 8 - 1 = 7$

Summary of example

Weak invariance with respect to gender was satisfied
- Can compare variance of latent trait over groups
In the example, depression was slightly more variable for females
- Males: Est = 1.000; SE = NA
- Females: Est = 1.263; SE = 0.120
To test homogeneity of variance, see coding examples

Strong / scalar invariance: Recap

Measurement model: All factor loadings and thresholds are equal over groups
Population model: Sufficient to estimate impact on factor (co-) variances and means
Considered acceptable for comparing groups on observed test scores
Wrinkle with categorical data, can also estimate variance of LRVs with strong invariance
- Most software will do this by default

Strong / scalar invariance: Code

# Fit model
fit.strong <- cfa(mod1, 
                  data = dat, 
                  std.lv = T,  
                  ordered = T, 
                  group = "cfemale",
                  group.equal = c("loadings", "thresholds")) # <--- new 
                  
# Print model summary
summary(fit.strong)

Strong / scalar invariance: Output

Comparing models

lavTestLRT(fit.config, fit.weak, fit.strong)


Scaled Chi-Squared Difference Test (method = "satorra.2000")

lavaan NOTE:
    The "Chisq" column contains standard test statistics, not the
    robust test that should be reported per model. A robust difference
    test is a function of two standard (not robust) statistics.
 
           Df AIC BIC   Chisq Chisq diff Df diff Pr(>Chisq)    
fit.config 40          37.534                                  
fit.weak   47          57.126     12.091       7    0.09761 .  
fit.strong 62         112.075     63.910      15    5.3e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Reject strong invariance using $\alpha = .05$
$8 \times 3$ thresholds constrained, but estimated mean of the latent trait and 8 $\Delta$ parameters, so $df = 24 - 1 - 8 = 15$

Summary of example

Strong invariance with respect to gender was not satisfied
In the example, depression was much higher on average for females
- Males: Est = 0.000; SE = NA
- Females: Est = .450; SE = 0.086
.45 SD difference between groups (SD $= 1$ for males)
But, we do not know if this is due to measurement bias, impact, or both (because the model was rejected)

Next Steps

Summary

We have seen how to test MI using factor analysis for categorical data
In our example, we found that the CINT assessment satisfied metric but not scalar invariance
- Implication – comparisons over gender may reflect measurement bias, impact, or both
Next, we consider how to find items that exhibit DIF
- Removing these items from the assessment will ensure mean comparisons on the CINT are unbiased and fair with respect to gender!

What we have done so far

MI and DIF in general
Factor analysis with categorical data
Testing MI using factor analysis
- Configural, weak, strong
- See Appendix and coding example for strict invariance
Illustrated methods using an example

What we will do next

Switch perspectives to IRT
IRT with binary and categorical data
Testing DIF using IRT
Illustrate methods using an example

References

Bauer, D. J. (2017). A more general model for testing measurement invariance and differential item functioning. Psychological Methods, 22(3), 507–526.

Curley, W. E., & Schmitt, A. P. (1993). Revising Sat®-Verbal Items to Eliminate Differential Item Functioning. ETS Research Report Series, 1993(2), i–18.

Holland, P. W. & Rosenbaum. P. R. (1986). Conditional Association and Unidimensionality in Monotone Latent Variable Models. The Annals of Statistics, 14(4), 1523–1543.

Millsap, R. E., & Yun-Tein, J. (2004). Assessing Factorial Invariance in Ordered-Categorical Measures. Multivariate Behavioral Research, 39(3), 479–515.

Muthen, B., & Asparouhov, T. (2002). Latent Variable Analysis With Categorical Outcomes: Multiple-Group And Growth Modeling In Mplus.

Satorra, A., & Bentler, P. (2001). A scaled difference chi-square test statistic for moment structure analysis. Psychometrika, 66, 507–514.

Wu, H., & Estabrook, R. (2016). Identification of Confirmatory Factor Analysis Models of Different Levels of Invariance for Ordered Categorical Outcomes. Psychometrika, 81(4), 1014–1045.

Appendix

Strict invariance: Recap

Measurement model: All factor loadings,thresholds, and residual variances are equal over groups
Population model: Sufficient to estimate impact on factor (co-) variances and means
Ensures test scores are equally reliable in both groups

Strict vs strong invariance

In factor analysis for continuous data, strict invariance is rarely tested
- It is unnecessary for estimating impact or comparing groups on observed scored
With categorical data: Should variances (“Deltas”) of LRVs be treated as real parameters?
- I don’t think so; see supplementary material for other opinions
If we want to ignore LRV variance, then should use strict rather than strong MI
easier to do with Theta parameterization, requires some new code in lavaan…
In our example, it won’t make a difference since we already rejected strong invariance

Strict invariance: Code

# Model syntax to constrain Delta = 1 in both group
mod.strict <- 
  'depression =~ cint1 + cint2 + cint4 + cint11 + 
                 cint27 + cint28 + cint29 + cint30
                 
  cint1 ~*~ c(1, 1)*cint1
  cint2 ~*~ c(1, 1)*cint2
  cint4 ~*~ c(1, 1)*cint4
  cint11 ~*~ c(1, 1)*cint11
  cint27 ~*~ c(1, 1)*cint27
  cint28 ~*~ c(1, 1)*cint28
  cint29 ~*~ c(1, 1)*cint29
  cint30 ~*~ c(1, 1)*cint30'

fit.strict <- cfa(mod.strict, # <-- new
                  data = dat, 
                  std.lv = T,  
                  ordered = T, 
                  group = "cfemale",
                  group.equal = c("loadings", "thresholds")) 

summary(fit.strict)

Strict invariance: Code

lavTestLRT(fit.config, fit.weak, fit.strict)


Scaled Chi-Squared Difference Test (method = "satorra.2000")

lavaan NOTE:
    The "Chisq" column contains standard test statistics, not the
    robust test that should be reported per model. A robust difference
    test is a function of two standard (not robust) statistics.
 
           Df AIC BIC   Chisq Chisq diff Df diff Pr(>Chisq)    
fit.config 40          37.534                                  
fit.weak   47          57.126     12.091       7    0.09761 .  
fit.strict 70         121.781     73.542      23  3.411e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note: $8 \times 3$ thresholds constrained, but estimated mean of the latent trait so $df = 24 - 1 = 23$
I think this is the correct $df$ for this comparison