Last week was STAAR Test week in Texas and the first time all public school students in the state took the newly updated and revised STAAR test. This test represents a major overhaul in the content, context, and cognitive demand of the STAAR test, from an instrument that was almost entirely multiple choice with one generic writing task in ELA to an assessment which is much more rigorous and includes literature-dependent writing of an entirely different order.* The new STAAR also includes more open-ended items — questions where a child must construct a response in their own words rather than select a response from a list of possibilities. The Texas Education Agency (TEA), which oversaw the development of the new test, says there are approximately six to seven times as many constructed response items on the new test. More open-ended student writing means there’s an increased need for scorers to evaluate and mark those responses. Sometimes, decisions are like pulling a thread on a sweater. One change leads to another, which leads to another, and before long there’s a pile of yarn on the floor. In the case of the revised STAAR, these decisions culminated in an announcement from the TEA that it is rolling out an “automated scoring engine” for constructed response items. That’s a somewhat roundabout way of saying that constructed responses will be graded by AI — artificial intelligence.
The TEA rather buried the lede here by focusing first and foremost on the cost savings. Instead of the 6000 human scorers needed to grade STAAR tests, they will be able to manage with fewer than 2000, saving the state $15 million. That’s no small thing. When questioned, the TEA revealed that they’d been using hybrid scoring since December 2023 (a combination of AI scoring and human scoring)**. Mentioning that this type of scoring has already been in use is supposed to be reassuring. An explanation of how it all works is also meant to soothe worries about its efficacy.
The AI system — which works like ChatGPT — was trained on 3,000 student responses that were first graded by two sets of human scorers. From this, the AI system is supposed to learn the characteristics of good, average, and poor responses and replicate that scoring on its own. Then, AI is supposed to evaluate its own work and assign a confidence level to the score it gave. If it gives itself a rating of “low confidence,” the response is flagged to be reviewed by a human scorer. It’s also supposed to flag responses it doesn’t understand. And just to be sure, scores are reviewed by TEA test administrators and a random sample of responses goes to a human scorer. TEA calls this a “robust quality control” process.
But there are hints that it might not be robust enough. At least one district whose students took the STAAR in December reported far more zeros on constructed responses. In the December testing, the state overall saw a sharp increase in zeros. The TEA says this is because new scoring guidelines allow student work to receive a zero if it is incoherent or provides no evidence for assertions.
And this is where we run into some intersectionality. By that, I mean that the AI scoring is being implemented at the same time as the state-wide rollout of the new test. This concurrence is going to make parsing the issues much more difficult. For example:
- Test questions aren’t always written very well and sometimes include content that they shouldn’t. A good example of this is revision of the TEKS standards for science. Districts aren’t supposed to be tested on this new content until 2024-25, but released items from the 2023-2024 STAAR science test included questions that addressed some of that content — content that was not in the old standards. Additionally, questions can be confusing or even just poorly written and the STAAR has a history of using reading passages that are higher than their purported grade levels.*** How will the state be able to tell if zeroes are coming from bad test questions, faulty reading passages, or bad AI scoring?
- As of 2020, the Texas public schools are majority Hispanic/Latino. How will AI scoring evaluate English Language Learners and students whose expressions include Spanish phrases, phrasing, and colloquialisms? Even though the test can be administered entirely in Spanish, this feels like a potential equity issue — and a big one. What safeguards are in place for students of color in Texas?
- The AI scorer learned from two sets of human scorers, but human scorers also come with a host of problems which have been documented pretty extensively. From the people who are hired to do it (seldom people in education or in particular content areas) to the very short amount of time they routinely spend on grading responses, to test companies telling them to lower their scores because too many kids are doing well, human scoring hasn’t always been a beacon of reliability. How well did AI learn from a model that was also potentially faulty?
- The state is allowing parents to request a rescore — for $50. The money is refunded if the new score is higher than the computer score, but that doesn’t change the fact that charging for this service makes it far less likely that economically disadvantaged families can – -or will — take this step. What will happen is that families with higher income will challenge results because they have the resources to do so. The benefit of the challenge system falls entirely to those with the most money and makes no provision for those without resources — yet another big equity issue brewing here.
We’re left with two issues that are the most worrisome.
First, to say that STAAR tests in Texas are high stakes is an understatement. These results are used to “grade” districts and those grades are used to justify state takeovers and other punitive measures. That should give the state pause; the consequences of poor scores can be extreme and any implementation of those consequences had better be for rock-solid reasons that can stand up to rigorous evaluation. I’m not sure AI scoring is reliable enough for that. And because the results of the STAAR are just that critical, districts are essentially held hostage to whatever mode of assessment the test utilizes, responsible for making sure kids do well on the test even at the expense of other learning.
Second, there’s evidence that districts are already doing something concerning: they’re using AI tools provided by the state to score assessments so they can figure out how to have kids better respond to open-ended questions. I’ve spent a lot of my professional career looking at test-prep materials and they are almost always uniformly bad — not engaging, not cognitively demanding, and not likely to result in retention. Their worst fault, however, is that test prep makes the floor the ceiling. Tests only — and can only — evaluate a tiny portion of all the learning students engage in at school. Using AI scorers doesn’t change that. And allowing AI scorers into the classroom means that responses the AI says are good become the only kinds of responses kids learn how to construct. That is a very risky move away from higher-order thinking, flexibility, and creativity and a move toward constraining opportunities to write in personally relevant ways with unique voices and expression and to make novel points with the evidence at hand. I am trying to think of ways this kind of prep might benefit kids and their writing and, other than as a means to do better on the STAAR, I am coming up empty.
There is a better way.
When districts have locally prioritized, specific curriculum objectives paired with clear tasks to measure mastery of those objectives, teachers can plan their instruction more effectively and know with more certainly whether students are succeeding. Standards are often broad and sprawling documents that must be navigated by feel; objectives and mastery tasks are focused, specific and measurable. Similarly, once those objectives are in place, districts can build curriculum guides that truly support teachers in their work and offer resources for scaffolding and reteaching, suggestions for grouping, strategies for approaching content, and more. If your district would like help developing curriculum guides that set both students and teacher up for success and promote higher-order thinking and authentic learning, please contact us! We would love to help.
You can read the Texas Tribune’s excellent article about this here.
*Here’s an example of the difference in ELA Writing. 4th grade old STAAR: Read the following quotation: “I do not know of anyone who has gotten to the top without hard work” — Margaret Thatcher. Think about the hard work you do. It may be work you do at home, at school, or outside. Write about one type of hard work you do. Tell about your work and explain why it is so hard to do. 4th grade new STAAR: Read the article from “Powwow Summer” and the article “Dancing Dragons.” [both about 300 words] Based on the information in both articles, write a response to the following: Explain how the people in both articles dance for similar reasons. Write a well-organized informational essay that uses specific evidence from the articles to support your answer.
**The TEA does NOT want anyone to think they are using AI or ChatGPT or anything similar. They maintain that the test scoring is a program designed by the state and the computer defaults to this every time it grades. It does not “learn” from every response it grades, but goes back to the original programing each time. This is to get into the semantics of ‘artificial intelligence’ versus ‘machine learning’. I checked with a software engineer on this and essentially, machine learning is just a type of AI. Once a computer learns something, it’s learned it and when it encounters responses it uses what it learned. The fact that the computer doesn’t continue to learn doesn’t actually make it not AI, though it does mean we don’t have to worry about it becoming our robot overlord (wink).
***The STAAR has come under fire more than once for problems with questions and reading passages. Before the first version of this test was even administered in 2012, faculty at Texas A & M wrote an article showing that the reading passages were higher than their purported grade levels — as much as 2 grade levels higher. In 2016, 71% of all students who took the English I test failed and another evaluation of reading passages found again that they were above grade level. I couldn’t find any evidence that their quality control process has controlled for these issues in the new test. I took the obvious step of taking a passage from the 2023 4th grade STAAR writing test and putting it through a readability analyzer. The analyzer assigned it a Flesch-Kincaid grade level of 6.2 and an “automated” grade level of 5. Make of that what you will, but it doesn’t seem to bode well.