Estimate bias and interaction screening terms
Usage
estimate_bias(
fit,
diagnostics,
facet_a = NULL,
facet_b = NULL,
interaction_facets = NULL,
max_abs = 10,
omit_extreme = TRUE,
max_iter = 4,
tol = 0.001
)Arguments
- fit
Output from
fit_mfrm().- diagnostics
Output from
diagnose_mfrm().- facet_a
First facet name. Provide together with
facet_bfor the classic pairwise 2-way interaction. Ignored wheninteraction_facetsis supplied.- facet_b
Second facet name. See
facet_a.- interaction_facets
Character vector of two or more facets to model as one interaction effect. When supplied, this takes precedence over
facet_a/facet_b. Use this form (rather thanfacet_a/facet_b) whenever you want 3+ way interactions, sincefacet_a/facet_bis restricted to the pairwise case.- max_abs
Bound for absolute bias size.
- omit_extreme
Omit extreme-only elements.
- max_iter
Iteration cap.
- tol
Convergence tolerance.
Value
An object of class mfrm_bias with:
table: interaction rows with effect size, SE, screening t/p metadata, reporting-use flags, fit columns, and bounded-GPCMprofile-likelihood columns when availablesummary: compact summary statisticschi_sq: fixed-effect chi-square style screening summaryfacet_a,facet_b: first two analyzed facet names (legacy compatibility)interaction_facets,interaction_order,interaction_mode: full interaction metadataiteration: iteration history/metadataorientation_review: facet-orientation sign-consistency review tablemixed_sign: logical flag indicating whether bias-size signs flip across facets in a way that complicates direction interpretationdirection_note: one-line interpretive note describing the dominant bias direction (empty when not applicable)recommended_action: one-line recommended-action label routing the user to the appropriate follow-up helperinference_tier: summary label indicating that the bias rows are intended for screening and follow-up review in this releaseoptimization_failures: per-cell record of any inner-loop optimizer failures encountered while estimating the bias parameters; empty when every cell converged cleanly
Details
Bias (interaction) in MFRM refers to a systematic departure from the additive model: a specific rater-criterion (or higher-order) combination produces scores that are consistently higher or lower than predicted by the main effects alone. For example, Rater A might be unexpectedly harsh on Criterion 2 despite being lenient overall.
Mathematically, the bias term \(b_{jc}\) for rater \(j\) on criterion \(c\) modifies the linear predictor:
$$\eta_{njc} = \theta_n - \delta_j - \beta_c - b_{jc}$$
For RSM / PCM, the function estimates \(b_{jc}\) from the residuals
of the fitted additive model using an iterative recalibration screen aligned
with the many-facet bias literature (Myford & Wolfe, 2003, 2004):
$$b_{jc} = \frac{\sum_n (X_{njc} - E_{njc})} {\sum_n \mathrm{Var}_{njc}}$$
Each iteration updates expected scores using the current bias estimates,
then re-computes the bias. Convergence is reached when the maximum absolute
change in bias estimates falls below tol. For bounded GPCM, the same
additive-bias idea is evaluated with the slope-aware GPCM kernel and
conditional profile-likelihood follow-up columns; those quantities remain
screening evidence because theta, facet, step, and slope estimates are held
fixed.
For two-way mode, use
facet_aandfacet_b(orinteraction_facetswith length 2).For higher-order mode, provide
interaction_facetswith length >= 3.
What this screening means
estimate_bias() summarizes interaction departures from the additive MFRM.
It is best read as a targeted screening tool for potentially noteworthy
cells or facet combinations that may merit substantive review.
What this screening does not justify
tandProb.are screening metrics, not formal inferential quantities.A flagged interaction cell is not, by itself, proof of rater bias or construct-irrelevant variance.
Non-flagged cells should not be over-read as evidence that interaction effects are absent.
Interpreting output
Use summary for global magnitude, then inspect table for cell-level
interaction effects.
Prioritize rows with:
larger
|Bias Size|(effect on logit scale; \(> 0.5\) logits is typically noteworthy, \(> 1.0\) is large)larger
|t|among the screening metrics (\(|t| \ge 2\) suggests a screen-positive interaction cell)smaller
Prob.among the screening metrics
A positive Obs-Exp Average means the cell produced higher scores
than the additive model predicts (unexpected leniency); negative
means unexpected harshness.
iteration helps verify whether iterative recalibration stabilized.
If the maximum change on the final iteration is still above tol,
consider increasing max_iter.
Typical workflow
Fit and diagnose model.
Run
estimate_bias(...)for target interaction facets.Review
summary(bias)andbias$table.Visualize/report via
plot_bias_interaction()andbuild_fixed_reports().
Interpreting key output columns
In bias$table, the most-used columns are:
Bias Size: estimated interaction effect \(b_{jc}\) (logit scale)tandProb.: screening metrics, not formal inferential quantitiesObs-Exp Average: direction and practical size of observed-vs-expected gap on the raw-score metricfor bounded
GPCM,LR ChiSq,LR Prob., andProfile CI Lower/Profile CI Upper: conditional profile-likelihood checks for a single additive bias shift, holding the fitted person, facet, step, and slope estimates fixed
The chi_sq element provides a fixed-effect heterogeneity screen across all
interaction cells.
Recommended next step
Use plot_bias_interaction() to inspect the flagged cells visually, then
integrate the result with DFF, linking, or substantive scoring review before
making formal claims about fairness or invariance.
References
Linacre, J. M. (1989). Many-Facet Rasch Measurement. MESA Press. (FACETS Table 13 corresponds to the bias / interaction estimation that this helper implements.)
Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2(3), 197-221.
Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments (2nd ed.). Peter Lang.
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386-422.
Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5(2), 189-227.
Examples
if (FALSE) { # interactive()
toy <- load_mfrmr_data("example_bias")
fit <- fit_mfrm(toy, "Person", c("Rater", "Criterion"), "Score", method = "JML", maxit = 30)
diag <- diagnose_mfrm(fit, residual_pca = "none")
bias <- estimate_bias(fit, diag, facet_a = "Rater", facet_b = "Criterion", max_iter = 2)
s_bias <- summary(bias)
s_bias$overview
# Look for: `MaxAbsBias` < ~0.5 logits and `Significant = 0` mean
# no cell exceeded the screen. The `BonferroniSignificant` /
# `HolmSignificant` columns count cells that survive multiple-
# testing correction; both being 0 is a stronger "no bias"
# signal than the raw screen-positive count alone.
s_bias$top_rows
# Look for: rows with `|t|` > 2 and |Bias Size| > 0.5 logits warrant
# review (large effect AND statistically reliable). Rows with only
# one of those triggered are usually small-cell artefacts.
p_bias <- plot_bias_interaction(bias, draw = FALSE)
p_bias$data$plot
}