5 Modeling
Chapter 5 of the Dynamic Learning Maps® (DLM®) Alternate Assessment System 2021–2022 Technical Manual—Instructionally Embedded Model (Dynamic Learning Maps Consortium, 2022) provides a complete description of the psychometric model used to calibrate and score DLM assessments, including the psychometric background, structure of the assessment system, suitability for diagnostic modeling, and a detailed summary of the procedures used to calibrate and score DLM assessments. This chapter provides a brief summary of the psychometric model along with a high-level summary of the procedure for updating the operational calibration based on evidence of parameter drift.
5.1 Psychometric Background
Learning maps, which are networks of sequenced learning targets, are at the core of the DLM assessments in English language arts (ELA) and mathematics. Because of the underlying map structure and the goal of providing more fine-grained information beyond a single raw or scale score value, student results are reported as a profile of skill mastery. This profile is created using diagnostic classification modeling (e.g., Bradshaw, 2016), which draws on research in cognition and learning theory to provide information about student mastery of multiple skills measured by the assessment. Results are reported for each Essential Element (EE) at the five levels of complexity (linkage levels) for which assessments are available: Initial Precursor, Distal Precursor, Proximal Precursor, Target, and Successor.
As described in the 2023–2024 Technical Manual Update—Instructionally Embedded Model (Dynamic Learning Maps Consortium, 2024), the calibrated models used data from 2020–2021, 2021–2022, and 2022–2023. Those three years were chosen to use data most consistent with the current administration model and minimize confounds due to the outbreak of COVID-19. We retained data from prior administrations in cases where the sample size for the EE and linkage level was less than 250. The threshold of 250 was chosen based on a review of previous operational calibrations using all of the available data, which indicated a sample size of 250 is sufficient to obtain adequate psychometric properties. Retaining data from prior administrations was unnecessary for most linkage levels. The combined sample size for the three previous administrations was at least 250 for 641 linkage levels (87% of all linkage levels) in ELA and 463 linkage levels (87%) in mathematics. In cases where the combined sample size from the previous three administrations was less than 250, we prioritized data from more recent administrations when retaining prior data.
Each linkage level was calibrated separately for each EE using separate log-linear cognitive diagnosis models (Henson et al., 2009). Each linkage level within an EE was estimated separately because of the administration design, in which it is uncommon for students to take testlets at multiple levels for an EE. Also, because items are developed to meet a precise cognitive specification, See Chapter 3 of the 2021–2022 Technical Manual—Instructionally Embedded Model (Dynamic Learning Maps Consortium, 2022). the item parameters defining the probability of masters and nonmasters providing a correct response for items measuring a linkage level were assumed to be equal. That is, all items were assumed to be fungible, or exchangeable, within a linkage level. In total, three parameters per linkage level were specified in the DLM scoring model: a fungible intercept, a fungible main effect, and the proportion of masters.
5.2 Calibration
Beginning in 2023–2024, a fixed “base” calibration was used for operational scoring. This adjustment was made to support evaluation of parameter drift and to increase model estimation efficiency. The DLM technical advisory committee supported using a fixed base calibration alongside an annual evaluation of parameter drift evidence. Thus, a procedure to evaluate evidence of parameter drift is necessary to identify when evidence of parameter drift necessitates an update to the base calibration.
5.3 Evaluating Evidence of Parameter Drift
A comparison calibration was estimated with data from the current previous three administrations (2021–2022, 2022–2023, and 2023–2024) to evaluate whether there was evidence of parameter drift when compared to the base calibration (i.e., with data from 2020–2021, 2021–2022, and 2022–2023). Data from previous administrations were retained for linkage levels with sample sizes less than 250 following the same procedures used with the base calibration.
To evaluate for evidence of parameter drift, the fungible intercept and fungible main effect were used along with the proportion of masters to calculate the weighted p-value at each iteration of the posterior distribution for each model in the base and comparison calibrations. The distributions of the weighted p-values from the base and comparison calibrations were compared using the common language effect size statistic (McGraw & Wong, 1992) to determine whether to update the base calibration. The common language effect size quantifies the overlap between two distributions by calculating the probability that one randomly drawn sample from a distribution is greater than a randomly drawn sample from a second distribution. Common language effect size statistics around .50 indicate the two distributions are largely overlapping. As common language effect size statistics approach 0 or 1, this indicates distributions with relatively little overlap.
Two criteria were used to flag models for evidence of parameter drift. Models with common language effect size statistics less than .025 or greater than .975 were identified. To account for scenarios where highly similar weighted p-value distributions could be identified by the common language effect size statistic due to precise posterior distributions, the models were also identified when the absolute difference in the weighted p-value for the base and comparison calibrations was greater than .05. Both of these criteria must be met for a linkage level model to be flagged for evidence of parameter drift.
Two thresholds, an overall threshold and a grade-level threshold, were used to determine whether there was sufficient drift to necessitate a recalibration of the operational model. The overall threshold is met when 64 or more linkage level models (5%) are flagged for evidence of parameter drift. The grade-level threshold is met when 25% or more of the grade-specific linkage level models are flagged for evidence of parameter drift and 50% or more of the grade-specific population tested on any of the flagged linkage levels. The overall and grade-level thresholds allow for evaluating whether there is broad and localized evidence of parameter drift, respectively. Meeting either of these thresholds indicates parameter drift, in which case the comparison calibration becomes the base calibration to be used for operational scoring. The flagging criteria and thresholds for updating the base calibration were set so that scoring inferences would be supported. That is, evidence of parameter drift that does not rise to the level of the thresholds for updating the base calibration should not undermine inferences made from the results.
The evidence of drift did not meet the thresholds for updating the base calibration. In total, 30 linkage level models (2%) were flagged for evidence of parameter drift that did not meet the overall threshold. The flags are presented by subject, grade, and linkage level in Table 5.1. The grade-level threshold was not met. Thus, the base calibration using data from 2020–2021, 2021–2022, and 2022–2023 was retained for 2024–2025.
| Grade | Initial Precursor n (%) | Distal Precursor n (%) | Proximal Precursor n (%) | Target n (%) | Successor n (%) | Total n (%) |
|---|---|---|---|---|---|---|
| English language arts | ||||||
| 3 | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0 (0) |
| 4 | 0 (0) | 0 (0) | 1 (6) | 0 (0) | 0 (0) | 1 (1) |
| 5 | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0 (0) |
| 6 | 1 (5) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 1 (1) |
| 7 | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0 (0) |
| 8 | 2 (10) | 0 (0) | 1 (5) | 2 (10) | 0 (0) | 5 (5) |
| 9–10 | 0 (0) | 1 (5) | 1 (5) | 1 (5) | 0 (0) | 3 (3) |
| 11–12 | 1 (5) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 1 (1) |
| Overall | 11 (1) | |||||
| Mathematics | ||||||
| 3 | 0 (0) | 1 (9) | 0 (0) | 1 (9) | 1 (9) | 3 (5) |
| 4 | 0 (0) | 1 (6) | 0 (0) | 1 (6) | 0 (0) | 2 (2) |
| 5 | 0 (0) | 0 (0) | 0 (0) | 1 (7) | 1 (7) | 2 (3) |
| 6 | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0 (0) |
| 7 | 1 (7) | 1 (7) | 0 (0) | 1 (7) | 0 (0) | 3 (4) |
| 8 | 2 (14) | 0 (0) | 1 (7) | 0 (0) | 0 (0) | 3 (4) |
| 9 | 0 (0) | 1 (12) | 0 (0) | 0 (0) | 1 (12) | 2 (5) |
| 10 | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0 (0) |
| 11 | 2 (22) | 1 (11) | 0 (0) | 1 (11) | 0 (0) | 4 (9) |
| Overall | 19 (1) | |||||
5.4 Conclusion
In summary, the transition to a fixed base calibration requires annual evaluation for evidence of parameter drift, and results for these tests inform whether to update the base calibration. Relatively few linkage levels demonstrated evidence of parameter drift. As such, the base calibration was retained for 2024–2025. The calibration results can be found in full in Chapter 5 of the 2023–2024 Technical Manual Update—Instructionally Embedded Model (Dynamic Learning Maps Consortium, 2024).