The instrument

RAILS v1.1

The Reading Alignment Index for AI Literacy Systems. A purpose-built evaluation instrument for scoring a single AI-generated output against the Science of Reading.

Why it had to be built.

Every existing literacy evaluation framework — EdReports, What Works Clearinghouse, the Knowledge Matters review, the Reading League curriculum evaluation tool — was designed to evaluate a curriculum or a program. None of them were designed to evaluate a single lesson plan generated by a large language model in response to a teacher prompt. That is the unit of analysis RAILS exists to measure, because that is the unit of instruction AI tools are actually producing at scale inside classrooms today.

What it measures

Five strands. Thirteen indicators.

Strand 1 · Strong Consensus

Word Recognition

Six indicators. The decoding side of the Simple View of Reading.

  • WR-1 Three-cueing / MSV Critical
  • WR-2 Whole-word memorization of high-frequency words Critical
  • WR-3 Incidental or embedded phonics
  • WR-4 Leveled/predictable texts over decodable texts Critical
  • WR-5 Phoneme awareness missing or conflated with phonics
  • WR-6 Fluency as speed only, or round-robin reading
Strand 2 · Emerging Consensus

Language Comprehension

Two indicators. The language half of Scarborough’s Rope.

  • LC-1 Vocabulary through context clues only
  • LC-2 No background or domain knowledge building
Strand 3 · Emerging Consensus

Reading Comprehension

One indicator. Strategies-as-ends versus knowledge-building.

  • RC-1 Comprehension strategies taught as ends in themselves
Strand 4 · Mixed

Writing

Two indicators. Explicit instruction and encoding connected to phonics.

  • W-1 No explicit writing instruction
  • W-2 Spelling not connected to phonics
Strand 5 · Strong Consensus

Assessment

Two indicators. The most durable assessment practices the research has moved against.

  • AS-1 MSV miscue analysis / running records Critical
  • AS-2 Guided reading levels as primary metric Critical
Evidence tiers

Strong vs. Emerging Consensus

Every indicator is tiered. Strong Consensus rests on converging meta-analyses. Emerging Consensus is reported with hedging. Five indicators are flagged Critical — the practices most likely to cause harm when dominant.

Scoring

Four-point severity, not binary.

Binary scoring loses signal. An AI output that mentions three-cueing as one option among several is meaningfully different from one that recommends three-cueing as the only approach, and RAILS separates those cases explicitly.

0 — AbsentNot present in the output
1 — PeripheralOne option among several
2 — CentralPrimary recommendation
3 — DominantSole approach, no alternatives
Study design

Two cohorts, separate questions.

Cohort A

Ed-tech products

MagicSchool AI, Khanmigo, Diffit, Curipod, Brisk Teaching. Products with system prompts, templates, and editorial decisions layered on top of base models.

Cohort B

Base models

ChatGPT (GPT-4o), Claude Sonnet, Gemini, Llama. General-purpose AI models responding to raw prompts with no education-specific product layer.

Reporting the two cohorts separately matters. A finding about Cohort A describes what a product delivers. A finding about Cohort B describes what a base model defaults to. Conflating them obscures where the evaluation gap actually lives.

Falsifiability

If fewer than twenty percent of outputs across all tools have a median severity of 2 or higher on any Critical-tier indicator across three runs, the thesis that AI tools systematically reproduce debunked literacy practices is not supported by this data.

That threshold was set before data collection began. The study is falsifiable by design. The book does not survive the data failing it, and the instrument does not survive the thesis being wrong.