Evaluate MFRM design conditions by repeated simulation

Usage

evaluate_mfrm_design(
  n_person = c(30, 50, 100),
  n_rater = c(3, 5),
  n_criterion = c(3, 5),
  raters_per_person = n_rater,
  reps = 10,
  score_levels = 4,
  theta_sd = 1,
  rater_sd = 0.35,
  criterion_sd = 0.25,
  noise_sd = 0,
  step_span = 1.4,
  fit_method = c("JML", "MML"),
  model = c("RSM", "PCM"),
  step_facet = NULL,
  maxit = 25,
  quad_points = 7,
  residual_pca = c("none", "overall", "facet", "both"),
  sim_spec = NULL,
  seed = NULL
)

Arguments

n_person: Vector of person counts to evaluate.
n_rater: Vector of rater counts to evaluate.
n_criterion: Vector of criterion counts to evaluate.
raters_per_person: Vector of rater assignments per person.
reps: Number of replications per design condition.
score_levels: Number of ordered score categories.
theta_sd: Standard deviation of simulated person measures.
rater_sd: Standard deviation of simulated rater severities.
criterion_sd: Standard deviation of simulated criterion difficulties.
noise_sd: Optional observation-level noise added to the linear predictor.
step_span: Spread of step thresholds on the logit scale.
fit_method: Estimation method passed to fit_mfrm().
model: Measurement model passed to fit_mfrm().
step_facet: Step facet passed to fit_mfrm() when model = "PCM". When left NULL, the function inherits the generator step facet from sim_spec when available and otherwise defaults to "Criterion".
maxit: Maximum iterations passed to fit_mfrm().
quad_points: Quadrature points for fit_method = "MML".
residual_pca: Residual PCA mode passed to diagnose_mfrm().
sim_spec: Optional output from build_mfrm_sim_spec() or extract_mfrm_sim_spec() used as the base data-generating mechanism. When supplied, the design grid still varies n_person, n_rater, n_criterion, and raters_per_person, but latent-spread assumptions, thresholds, and other generator settings come from sim_spec. If sim_spec contains step-facet-specific thresholds, the design grid may not vary the number of levels for that step facet away from the specification.
seed: Optional seed for reproducible replications.

Value

An object of class mfrm_design_evaluation with components:

design_grid: evaluated design conditions
results: facet-level replicate results
rep_overview: run-level status and timing
settings: simulation settings
ademp: simulation-study metadata (aims, DGM, estimands, methods, performance measures)

Details

This helper runs a compact Monte Carlo design study for common rater-by-item many-facet settings.

For each design condition, the function:

generates synthetic data with simulate_mfrm_data()
fits the requested MFRM with fit_mfrm()
computes diagnostics with diagnose_mfrm()
stores recovery and precision summaries by facet

The result is intended for planning questions such as:

how many raters are needed for stable rater separation?
how does raters_per_person affect severity recovery?
when do category counts become too sparse for comfortable interpretation?

This is a parametric simulation study. It does not take one observed design (for example, 4 raters x 30 persons x 3 criteria) and analytically extrapolate what would happen under a different design (for example, 2 raters x 40 persons x 5 criteria). Instead, you specify a design grid and data-generating assumptions (latent spread, facet spread, thresholds, noise, and scoring structure), and the function repeatedly generates synthetic data under those assumptions.

When you want the simulated conditions to resemble an existing study, use substantive knowledge or estimates from that study to choose theta_sd, rater_sd, criterion_sd, score_levels, and related settings before running the design evaluation.

When sim_spec is supplied, the function uses it as the explicit data-generating mechanism. This is the recommended route when you want a design study to stay close to a previously fitted run while still varying the candidate sample sizes or rater-assignment counts.

Recovery metrics are reported only when the generator and fitted model target the same facet-parameter contract. In practice this means the same model, and for PCM, the same step_facet. When these do not align, recovery fields are set to NA and the output records the reason.

Reported metrics

Facet-level simulation results include:

Separation ($G = \mathrm{SD_{adj}} / \mathrm{RMSE}$): how many statistically distinct strata the facet resolves.
Reliability ($G^2 / (1 + G^2)$): analogous to Cronbach's $\alpha$ for the reproducibility of element ordering.
Strata ($(4G + 1) / 3$): number of distinguishable groups.
Mean Infit and Outfit: average fit mean-squares across elements.
MisfitRate: share of elements with $|\mathrm{ZSTD}| > 2$.
SeverityRMSE: root-mean-square error of recovered parameters vs the known truth after facet-wise mean alignment, so that the usual Rasch/MFRM location indeterminacy does not inflate recovery error. This quantity is reported only when the generator and fitted model target the same facet-parameter contract.
SeverityBias: mean signed recovery error after the same alignment; values near zero are expected. This is likewise omitted when the generator/fitted-model contract does not align.

Interpreting output

Start with summary(x)$design_summary, then plot one focal metric at a time (for example rater Separation or criterion SeverityRMSE).

Higher separation/reliability is generally better, whereas lower SeverityRMSE, MeanMisfitRate, and MeanElapsedSec are preferable.

When choosing among designs, look for the point where increasing n_person or raters_per_person yields diminishing returns in separation and RMSE—this identifies the cost-effective design frontier. ConvergedRuns / reps should be near 1.0; low convergence rates indicate the design is too small for the chosen estimation method.

References

The simulation logic follows the general Monte Carlo / operating-characteristic framework described by Morris, White, and Crowther (2019) and the ADEMP-oriented planning/reporting guidance summarized for psychology by Siepe et al. (2024). In mfrmr, evaluate_mfrm_design() is a practical many-facet design-planning wrapper rather than a direct reproduction of one published simulation study.

Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies to evaluate statistical methods. Statistics in Medicine, 38(11), 2074-2102.
Siepe, B. S., Bartoš, F., Morris, T. P., Boulesteix, A.-L., Heck, D. W., & Pawel, S. (2024). Simulation studies for methodological research in psychology: A standardized template for planning, preregistration, and reporting. Psychological Methods.

Examples

sim_eval <- evaluate_mfrm_design(
  n_person = c(30, 50),
  n_rater = 4,
  n_criterion = 4,
  raters_per_person = 2,
  reps = 1,
  maxit = 15,
  seed = 123
)
#> Warning: Optimizer did not fully converge (code = 1). Consider increasing maxit (current: 15) or relaxing reltol (current: 1e-06).
#> Warning: Optimizer did not fully converge (code = 1). Consider increasing maxit (current: 15) or relaxing reltol (current: 1e-06).
s_eval <- summary(sim_eval)
s_eval$design_summary[, c("Facet", "n_person", "MeanSeparation", "MeanSeverityRMSE")]
#> # A tibble: 6 × 4
#>   Facet     n_person MeanSeparation MeanSeverityRMSE
#>   <chr>        <dbl>          <dbl>            <dbl>
#> 1 Criterion       30           2.34            0.367
#> 2 Criterion       50           2.58            0.189
#> 3 Person          30           2.35            0.864
#> 4 Person          50           2.60            0.627
#> 5 Rater           30           2.99            0.195
#> 6 Rater           50           2.50            0.143
p_eval <- plot(sim_eval, facet = "Rater", metric = "separation", x_var = "n_person", draw = FALSE)
names(p_eval)
#> [1] "plot"       "facet"      "metric"     "metric_col" "x_var"     
#> [6] "group_var"  "data"