Test theoretical model of student engagement using SEM with 400 participants. Help me: 1) Specify measurement model (CFA), 2) Assess model fit (CFI, TLI, RMSEA, SRMR), 3) Test structural model with paths, 4) Evaluate direct and indirect effects, 5) Report results with path diagram, 6) Use lavaan in R or Mplus
Certainly! Conducting a structural equation modeling (SEM) analysis requires methodical steps. I’ll guide you through this process using R with the lavaan
package.
1) Specify Measurement Model (CFA)
Start by specifying your measurement model, which defines the relationships between observed variables and their latent constructs (factors).
Assuming you have the following latent constructs related to student engagement:
- Cognitive Engagement (C1, C2, C3)
- Emotional Engagement (E1, E2, E3)
- Behavioral Engagement (B1, B2, B3)
You can specify your confirmatory factor analysis (CFA) model as follows:
library(lavaan)# Define the measurement model
measurement_model <- '
# Latent variables
Cognitive =~ C1 + C2 + C3
Emotional =~ E1 + E2 + E3
Behavioral =~ B1 + B2 + B3
'
# Fit the model
fit_measurement <- cfa(measurement_model, data = your_data_frame_here)
summary(fit_measurement, standardized = TRUE)
2) Assess Model Fit
After fitting the measurement model, evaluate model fit using the following indices:
- CFI (Comparative Fit Index)
- TLI (Tucker-Lewis Index)
- RMSEA (Root Mean Square Error of Approximation)
- SRMR (Standardized Root Mean Square Residual)
You can extract these fit indices from your fitted model:
# Assess model fit
fit_indices <- fitMeasures(fit_measurement, c("cfi", "tfi", "rmsea", "srmr"))
print(fit_indices)
3) Test Structural Model with Paths
Once the measurement model is validated, specify the structural model, which includes the relationships among latent constructs.
Assuming you hypothesize that Emotional Engagement influences Cognitive Engagement, which in turn influences Behavioral Engagement:
# Define structural model
structural_model <- '
# Measurement model as in previous step
Cognitive =~ C1 + C2 + C3
Emotional =~ E1 + E2 + E3
Behavioral =~ B1 + B2 + B3
# Structural paths
Cognitive ~ Emotional
Behavioral ~ Cognitive
'# Fit the structural model
fit_structural <- sem(structural_model, data = your_data_frame_here)
summary(fit_structural, standardized = TRUE)
4) Evaluate Direct and Indirect Effects
To evaluate direct and indirect effects, use the parameterEstimates()
function and the semTools
package for the computation of indirect effects if needed:
# Evaluate direct, indirect, and total effects
direct_effects <- parameterEstimates(fit_structural, standardized = TRUE)
indirect_effects <- lavaan::indirect(fit_structural)
total_effects <- direct_effects$est + indirect_effects
5) Report Results with Path Diagram
You can create a path diagram visually summarizing your model. The semPlot
library in R can be used for this purpose:
library(semPlot)# Create path diagram
semPaths(fit_structural, whatLabels = "stand", edge.label.cex = 1, sizeMan = 5, sizeInt = 7, layout = "tree")
6) Use lavaan
in R or Mplus
The guidance above is using lavaan
in R. If you prefer using Mplus, here’s a similar approach:
TITLE: Student Engagement SEMDATA: FILE = "yourdata.dat";
VARIABLE:
NAMES = C1 C2 C3 E1 E2 E3 B1 B2 B3;
USEVARIABLES = C1 C2 C3 E1 E2 E3 B1 B2 B3;
MODEL:
Cognitive BY C1 C2 C3;
Emotional BY E1 E2 E3;
Behavioral BY B1 B2 B3;
Cognitive ON Emotional;
Behavioral ON Cognitive;
OUTPUT: SAMPSTAT STANDARDIZED CINTERVAL;
Conclusion
This process systematically guides you through specifying a measurement model, assessing its fit, testing a structural model, and reporting the findings. Modify the latent variables according to your theoretical framework and ensure your data is preprocessed and appropriately scaled for analysis. Also, replace yourdataframe_here
and file names with your actual dataset.
This is an excellent plan to test your theoretical model! Structural Equation Modeling (SEM) is perfectly suited for this, allowing you to simultaneously test both your measurement model (how latent variables are measured by observed items) and your structural model (the hypothesized relationships between these latent variables).
Let's break down each step. I'll provide examples using lavaan
in R, as it's a powerful and open-source option. I'll also briefly mention Mplus
syntax for comparison where appropriate.
First, let's conceptualize a sample theoretical model of student engagement to make the code concrete.
Example Theoretical Model:
Let's assume your theoretical model posits:
- Exogenous Latent Variables:
PerceivedInstructorSupport
(PIS)SelfEfficacy
(SE)- Endogenous Latent Variable (Mediator):
AcademicMotivation
(AM)- Endogenous Latent Variable (Outcome):
StudentEngagement
(SEng)
Hypothesized Paths:
- PIS -> AM
- SE -> AM
- AM -> SEng
- PIS -> SEng (direct effect)
- SE -> SEng (direct effect)
This structure allows us to evaluate direct and indirect effects.
Prerequisites:
- Install
lavaan
package in R:install.packages("lavaan")
- Load
lavaan
:library(lavaan)
- Data: You'll need your dataset, ideally in a
data.frame
format, where each column represents an observed item. For this example, I'll simulate some data.
Data Simulation (for demonstration purposes)
Let's assume each latent variable is measured by 3 observed indicators.
# Load lavaan
library(lavaan)
library(dplyr) # For data manipulation# Set seed for reproducibility
set.seed(123)
# Number of participants
n_participants <- 400
# Simulate latent variables (these are *conceptual* - we don't directly observe them)
# We'll build the observed items based on these
PIS_latent <- rnorm(n_participants, mean = 5, sd = 1.5)
SE_latent <- rnorm(n_participants, mean = 4, sd = 1.2)
AM_latent <- 0.6 * PIS_latent + 0.4 * SE_latent + rnorm(n_participants, mean = 0, sd = 1)
SEng_latent <- 0.7 * AM_latent + 0.3 * PIS_latent + 0.2 * SE_latent + rnorm(n_participants, mean = 0, sd = 1)
# Simulate observed indicators with some measurement error
# Perceived Instructor Support (PIS)
pis1 <- 0.7 * PIS_latent + rnorm(n_participants, mean = 0, sd = 0.5)
pis2 <- 0.8 * PIS_latent + rnorm(n_participants, mean = 0, sd = 0.6)
pis3 <- 0.6 * PIS_latent + rnorm(n_participants, mean = 0, sd = 0.4)
# Self-Efficacy (SE)
se1 <- 0.75 * SE_latent + rnorm(n_participants, mean = 0, sd = 0.5)
se2 <- 0.7 * SE_latent + rnorm(n_participants, mean = 0, sd = 0.6)
se3 <- 0.8 * SE_latent + rnorm(n_participants, mean = 0, sd = 0.5)
# Academic Motivation (AM)
am1 <- 0.8 * AM_latent + rnorm(n_participants, mean = 0, sd = 0.5)
am2 <- 0.7 * AM_latent + rnorm(n_participants, mean = 0, sd = 0.6)
am3 <- 0.75 * AM_latent + rnorm(n_participants, mean = 0, sd = 0.5)
# Student Engagement (SEng)
seng1 <- 0.85 * SEng_latent + rnorm(n_participants, mean = 0, sd = 0.4)
seng2 <- 0.8 * SEng_latent + rnorm(n_participants, mean = 0, sd = 0.5)
seng3 <- 0.7 * SEng_latent + rnorm(n_participants, mean = 0, sd = 0.6)
# Combine into a data frame
sim_data <- data.frame(pis1, pis2, pis3, se1, se2, se3, am1, am2, am3, seng1, seng2, seng3)
# Display a quick summary
summary(sim_data)
head(sim_data)
1. Specify Measurement Model (Confirmatory Factor Analysis - CFA)
Before testing the structural relationships, it's crucial to confirm that your observed items reliably measure your latent constructs. This is done via Confirmatory Factor Analysis (CFA).
lavaan
Syntax:
latent_variable =~ item1 + item2 + item3
defines a latent variable and its indicators.- By default,
lavaan
fixes the first loading of each factor to 1 (marker variable approach) or fixes the variance of the latent variable to 1 (standardized approach). The latter is often preferred for interpretability.
# Define the measurement model
measurement_model <- '
# Latent variables defined by their indicators
PIS =~ pis1 + pis2 + pis3
SE =~ se1 + se2 + se3
AM =~ am1 + am2 + am3
SEng =~ seng1 + seng2 + seng3 # Optional: Allow error terms of related items to covary if theoretically justified.
# For example, if pis1 and pis2 are similarly worded, their errors might correlate.
# pis1 ~~ pis2
'
# Fit the CFA model
cfa_fit <- cfa(measurement_model, data = sim_data, std.lv = TRUE) # std.lv=TRUE standardizes latent variables
# Mplus Equivalent (example for CFA):
# TITLE: CFA Model for Student Engagement;
# DATA: FILE IS "your_data.dat"; ! or your_data.csv
# VARIABLE:
# NAMES ARE pis1 pis2 pis3 se1 se2 se3 am1 am2 am3 seng1 seng2 seng3;
# USEVARIABLES ARE pis1 pis2 pis3 se1 se2 se3 am1 am2 am3 seng1 seng2 seng3;
# MODEL:
# PIS BY pis1 pis2 pis3;
# SE BY se1 se2 se3;
# AM BY am1 am2 am3;
# SEng BY seng1 seng2 seng3;
# OUTPUT: STANDARDIZED MODINDICES;
2. Assess Model Fit (CFI, TLI, RMSEA, SRMR)
After fitting the CFA, you need to evaluate how well it fits your data. Good fit indicates that your observed items indeed measure the intended latent constructs as theorized.
lavaan
Output and Interpretation:
# Get summary with fit measures
summary(cfa_fit, fit.measures = TRUE, standardized = TRUE, rsquare = TRUE)# You can also extract specific fit measures
fit_indices <- fitMeasures(cfa_fit)
print(fit_indices[c("cfi", "tli", "rmsea", "srmr")])
Common Fit Indices and Interpretation Thresholds:
- Chi-square ($\chi^2$): Should be non-significant (p > 0.05) ideally, but with large samples (like n=400), it's almost always significant. Use other indices.
- Degrees of Freedom (df): Reported with $\chi^2$.
- Comparative Fit Index (CFI):
- Good Fit: $\ge 0.95$
- Acceptable Fit: $\ge 0.90$
- Tucker-Lewis Index (TLI) / Non-Normed Fit Index (NNFI):
- Good Fit: $\ge 0.95$
- Acceptable Fit: $\ge 0.90$
- Root Mean Square Error of Approximation (RMSEA):
- Good Fit: $\le 0.06$
- Acceptable Fit: $\le 0.08$
- Include its 90% Confidence Interval (CI) and p-value for close fit (PCLOSE), which should be > 0.05 for good fit.
- Standardized Root Mean Square Residual (SRMR):
- Good Fit: $\le 0.08$
Steps to Take:
- Examine Loadings: All standardized factor loadings should be statistically significant (p < 0.05) and generally above 0.5 (ideally 0.7 or higher).
- Inspect Residuals/Modification Indices: If fit is poor, examine standardized residuals (large values suggest misfitting items) and modification indices (
modindices(cfa_fit, sort = TRUE, minimum.value = 10)
) to identify potential issues (e.g., cross-loadings, correlated errors). Caution: Make changes based on theoretical justification, not just statistical improvement.
3. Test Structural Model with Paths
Once your measurement model is adequately fitting, you can combine it with your hypothesized structural paths between the latent variables.
lavaan
Syntax:
DV ~ IV1 + IV2
defines regression paths whereDV
is dependent andIV
is independent.~~
defines covariances (e.g., between exogenous latent variables).
# Define the full structural model
structural_model <- '
# Measurement Model (from CFA)
PIS =~ pis1 + pis2 + pis3
SE =~ se1 + se2 + se3
AM =~ am1 + am2 + am3
SEng =~ seng1 + seng2 + seng3 # Structural Model (Paths between latent variables)
# Endogenous variable ~ Exogenous variable(s)
AM ~ PIS + SE # Academic Motivation is predicted by Perceived Instructor Support and Self-Efficacy
SEng ~ AM + PIS + SE # Student Engagement is predicted by Academic Motivation, PIS (direct), and SE (direct)
# Covariances between exogenous latent variables (if any)
PIS ~~ SE
'
# Fit the SEM model
sem_fit <- sem(structural_model, data = sim_data, std.lv = TRUE)
# Mplus Equivalent (example for SEM):
# TITLE: Full SEM Model for Student Engagement;
# DATA: FILE IS "your_data.dat";
# VARIABLE:
# NAMES ARE ...;
# USEVARIABLES ARE ...;
# MODEL:
# PIS BY pis1 pis2 pis3;
# SE BY se1 se2 se3;
# AM BY am1 am2 am3;
# SEng BY seng1 seng2 seng3;
#
# AM ON PIS SE;
# SEng ON AM PIS SE;
#
# PIS WITH SE;
# OUTPUT: STANDARDIZED CINTERVAL;
4. Evaluate Direct and Indirect Effects
For models involving mediation, you'll want to assess the direct effect of an independent variable on an outcome, the indirect effect (via a mediator), and the total effect.
lavaan
Syntax for Direct, Indirect, and Total Effects:
- You can label paths and then define new parameters based on these labels.
:=
is used to define new parameters (e.g., indirect effects).
# Define the full structural model WITH LABELS for paths
structural_model_with_effects <- '
# Measurement Model
PIS =~ pis1 + pis2 + pis3
SE =~ se1 + se2 + se3
AM =~ am1 + am2 + am3
SEng =~ seng1 + seng2 + seng3 # Structural Model with Labeled Paths
AM ~ c1*PIS + c2*SE
SEng ~ d1*AM + d2*PIS + d3*SE
# Covariances between exogenous latent variables
PIS ~~ SE
# Define Indirect Effects
PIS_to_SEng_indirect := c1 * d1
SE_to_SEng_indirect := c2 * d1
# Define Total Effects (Direct + Indirect)
PIS_to_SEng_total := d2 + (c1 * d1)
SE_to_SEng_total := d3 + (c2 * d1)
'
# Fit the SEM model with defined effects
sem_effects_fit <- sem(structural_model_with_effects, data = sim_data, std.lv = TRUE)
# Get summary with all estimates, including user-defined effects
# Use standardized=TRUE for standardized effects
summary(sem_effects_fit, fit.measures = TRUE, standardized = TRUE, rsquare = TRUE)
# For confidence intervals of indirect effects, bootstrapping is highly recommended.
# This can take some time (e.g., 1000-5000 bootstrap draws)
# sem_boot_fit <- sem(structural_model_with_effects, data = sim_data, std.lv = TRUE, se = "bootstrap", bootstrap = 1000)
# summary(sem_boot_fit, fit.measures = TRUE, standardized = TRUE, rsquare = TRUE)
# Mplus Equivalent (Indirect Effects):
# Add 'INDIRECT;' to the OUTPUT command in Mplus. Mplus will automatically calculate and report
# direct, indirect, and total effects for all possible pathways.
# OUTPUT: STANDARDIZED CINTERVAL INDIRECT;
Interpretation:
- Examine the
PIStoSEngindirect
andSEtoSEngindirect
estimates (both standardized and unstandardized) and their p-values/confidence intervals. If the CI does not include zero, the indirect effect is significant. - Compare the direct effects (
d2
andd3
) to the indirect effects to understand the mechanisms. - The total effect represents the full influence, combining both direct and indirect pathways.
5. Report Results with Path Diagram
Reporting SEM results requires a comprehensive overview of your model, its fit, and the significance of its paths.
Written Report Structure:
- Introduction: Briefly state the purpose of the SEM, the theoretical model being tested, and your hypotheses.
- Methods:
- Participants: Describe your 400 participants.
- Measures: Detail the observed indicators for each latent variable, including reliability (e.g., Cronbach's alpha from a prior CFA/EFA, or composite reliability from the current CFA).
- Data Analysis: State that SEM was used, specifying
lavaan
in R (or Mplus). Explain the two-step approach (CFA then structural model). Mention handling of missing data if applicable. - Results:
- Measurement Model (CFA) Fit:
- Report the global fit indices ($\chi^2$, df, p-value, CFI, TLI, RMSEA [with 90% CI and PCLOSE], SRMR).
- State whether the model demonstrated good/acceptable fit based on your chosen thresholds.
- Discuss the factor loadings: "All observed indicators loaded significantly onto their hypothesized latent factors, with standardized loadings ranging from X to Y."
- If any modifications were made (e.g., correlated errors), justify them theoretically.
- Structural Model Fit:
- Report the fit indices for the full structural model. It should ideally be similar to the CFA fit (or slightly worse, but still acceptable).
- Structural Path Coefficients:
- Present the standardized path coefficients ($\beta$), their standard errors (SE), z-values, and p-values for each hypothesized path.
- Interpret the direction and significance of each path:
- "Perceived Instructor Support significantly predicted Academic Motivation ($\beta$ = X, p < .001)."
- "Academic Motivation significantly predicted Student Engagement ($\beta$ = Y, p < .01)."
- "The direct effect of Self-Efficacy on Student Engagement was not significant ($\beta$ = Z, p = .XXX)."
- Report $R^2$ values for endogenous latent variables (e.g., "Academic Motivation explained X% of the variance in Student Engagement.").
- Direct, Indirect, and Total Effects:
- Clearly report the standardized estimates, SEs, p-values (or 95% CIs) for each calculated indirect and total effect.
- "The indirect effect of Perceived Instructor Support on Student Engagement via Academic Motivation was significant ($\beta$ = A, p < .001, 95% CI [lower, upper]), indicating..."
- Discussion:
- Summarize key findings in relation to your hypotheses.
- Discuss theoretical and practical implications.
- Acknowledge limitations (e.g., cross-sectional data limits causal claims, sample specificities).
- Suggest future research directions.
Path Diagram:
A visual representation is essential. You cannot generate a true graphical path diagram directly in a text output, but here's what it should contain and tools to create it:
- Components:
- Latent Variables: Represented by ovals/circles.
- Observed Variables: Represented by rectangles/squares.
- Directed Arrows (Paths): Represent single-headed arrows indicating regression relationships from one variable to another.
- Curved Arrows (Covariances): Represent double-headed arrows indicating correlations/covariances between variables (usually between exogenous variables or error terms).
- Error Terms: Represented by small circles/ovals pointing to endogenous latent variables and observed variables.
- Labels:
- Clearly label all latent and observed variables.
- Include standardized path coefficients ($\beta$) on all structural paths.
- Indicate significance levels (e.g., , , for p < .05, p < .01, p < .001).
- Include $R^2$ values for endogenous latent variables.
- Tools for creating diagrams:
semPlot
package in R:library(semPlot); semPaths(semeffectsfit, "std", layout = "spring", edge.label.cex = 0.7, nCharNodes = 0, sizeMan = 8, sizeLat = 10, curveAll = TRUE, fade = FALSE)
- Mplus Diagrammer: If using Mplus, it has a built-in diagramming tool.
- Drawing Software: PowerPoint, Google Slides, draw.io, Lucidchart, Inkscape, or specialized academic drawing tools.
6. Use lavaan
in R or Mplus
You've seen examples for both.
lavaan
(R):- Pros: Open-source, highly flexible, integrates well with the R ecosystem for data manipulation and visualization, excellent documentation and community support.
- Cons: Requires R coding, output can sometimes be verbose (but
summary()
options help). Mplus
:- Pros: Industry standard, very powerful for complex models (e.g., multilevel, categorical, latent class), robust handling of non-normal data and missing data (MLR, MICE), clear and concise output.
- Cons: Commercial software (expensive license), command-line interface (can be steep learning curve for new users).
Recommendation: Given you're asking for help and likely starting,
lavaan
is an excellent choice due to its accessibility and power. It can handle most common SEM tasks very well. If your model becomes highly complex (e.g., very high number of categorical indicators, or complex multilevel structures), then Mplus
might offer more specialized tools. For N=400 and a typical engagement model, lavaan
is perfectly capable.General Considerations and Best Practices:
- Sample Size: N=400 is generally considered a good sample size for SEM, providing sufficient power for model estimation and fit assessment.
- Data Cleaning: Ensure your data is clean (no impossible values), handle outliers, and check for missing data.
lavaan
defaults to listwise deletion. For missing data, consider Full Information Maximum Likelihood (FIML) whichlavaan
can do withmissing = "fiml"
, or multiple imputation. - Assumptions: SEM assumes multivariate normality (or you can use robust estimators like
estimator = "MLR"
inlavaan
for non-normal data, which provides robust standard errors and a scaled chi-square test statistic). Also assumes linearity in relationships. - Identification: Ensure your model is identified (i.e., there are enough known pieces of information to estimate all parameters). Standard CFA/SEM models are usually identified, but complex or under-specified models can be problematic.
lavaan
will often warn you if there are identification issues. - Theoretical Justification: Every path and model modification should be theoretically justified. Avoid making purely data-driven changes that lack theoretical backing.
Good luck with your analysis! This structured approach will provide a robust test of your theoretical model.
💡 Try this comparison yourself:Compare AI models side-by-side on SNEOS
Analysis
This comparison demonstrates the different approaches each AI model takes when responding to the same prompt. Here are the key differences observed:
Response Characteristics
ChatGPT: Provides a detailed response with 13 sentences.
Gemini: Provides a direct response with 221 sentences.
Key Takeaways
- Each model brings unique strengths to this type of query
- Response styles vary significantly between models
- Consider your specific use case when choosing between these models
Try This Comparison Yourself
Want to test these models with your own prompts? Visit SNEOS.com to compare AI responses side-by-side in real-time.
This comparison was generated using the SNEOS AI Comparison ToolPublished: October 15, 2025 | Models: ChatGPT, Gemini