In Part I, we discussed at length the limitations AI has when used to generate texts or even just supposedly factual information about a wide array of topics. Those limitations, which include frequent factual inaccuracy and amplified racial, gender, and other biases, are disturbing, and as such, have important implications for how AI can and should be used by teachers.
Among the first to use AI for grading have been standardized test companies and state testing agencies. All companies train the AI tool to grade essays appropriately, using thousands of examples of student essays that have already been rated by humans. Each testing year, virtually all companies route a portion of the essay grading to humans to assess the reliability of AI’s grading. Occasionally, the AI grading gets so off track that the company has to pause AI grading for period and send the grading back to humans while they retrain the AI grading program. This happened in New Hampshire in 2024. Using AI like this isn’t new. What is new is the availability of AI for the average person — including classroom teachers.
Some scholarly research has emerged around the use of AI in higher education. A study out of Great Britain found that although AI-generated scores on university exams were superficially similar to those of a human grader, it tended to produce more medium scores and fewer extreme ones compared to humans, resulting in human-AI grade differences in 70% of cases. Researchers noted that AI scores were often within 10% of human scoring, but I’d like to point out that 10% is not an insignificant amount. Other academic research noted that computer-grading of constructed responses (short answer) tended to give lower scores to English Language Learners than did human grading — possible evidence of the biases inherent in AI’s data sets. Ohio State University cites a study that found grading accuracy and reliability decreased when the writing topics were controversial or complex. That same study also found key differences in the rigor of AI grading, with low-performing essays being graded more leniently and high-performing essays more harshly. Ohio State’s conclusion was that although human grading can also be biased in key ways, it is still the gold standard and AI is not trustworthy enough to be used in grading without considerable human supervision. Keep that in mind as we move into how AI is being used at the classroom level.
A 2023 survey conducted by Rand found that 18% of K–12 teachers were using AI for teaching (i.e. to develop units and lesson plans)* or to tailor activities to students’ specific abilities (i.e. for differentiation and scaffolding) — and middle and high school teachers and those who taught ELA or social studies were the most likely to be AI users**. But since 2023, the use of AI in education has grown. Teachers are now moving beyond lesson plan creation to evaluating student writing with AI; at CMSi, we have seen it used to grade student writing beginning in 3rd grade. So here’s where we are: AI grading is demonstrably unreliable and decreases in reliability the more complex the task, but more and more classroom teachers are using it.
This is an emerging and fluid field, so what you’re going to get here is inescapably my own analysis. AI (at least theoretically) does better with simple tasks. But the higher the grade level, the more complex the writing task. In the highest grade levels, like AP, topics may be intentionally controversial since writing and supporting a cogent argument is a critical tool for students to master. And “complex and controversial” is where AI falters. I’ve read anecdotal evidence that described AI’s essay grading as “generic in a specific way.” AI suggested essentially the same things over and over, regardless of the paper it was grading. I’ve personally seen AI-graded essays that lauded the student’s ability (“You did a great job putting information in paragraphs!”) while ignoring the use of words that weren’t words, illogical, fragmented, and redundant sentences, and a lack of supporting evidence. I’ve seen multiple suggestions — often from AI companies — to use AI just for grammar and usage grading, but although we often separate grammar and usage from the voice and thinking behind an essay, that’s more to have a specific, concrete justification for points than because they are actually separate. The two are very entwined; grammar and usage significantly impact thinking and structure and voice in key ways. I don’t think AI is capable of assessing linguistic flair or even logic. We know it’s not capable of ascertaining truth. Nuance might be in AI’s vocabulary, but it’s not part of its skill set. I would argue further that using AI to grade papers doesn’t do much to foster a good relationship between the teacher and student, or allow the teacher to really get to know how a student thinks as expressed in their writing. It’s impossible not to arrive where Ohio State’s researchers did: AI grading isn’t trustworthy enough to be used without a lot of careful human supervision.
And there’s one last thing to consider. Since teachers virtually everywhere have banned students from using AI to write their papers, there is a distinct whiff of hypocrisy around the use of AI to grade those essays, but another aspect is the human connection and understanding between teacher and student that is incomprehensible to a machine. One student in a New York Times Article put it this way:
How can we expect an algorithm to grasp the individuality of a child, the mistakes they made, and furthermore the reason behind them — let alone the inherent subjectivity that is coupled with a subject such as English? Artificial intelligence couldn’t possibly (or at least shouldn’t) comprehend the nuance it takes to teach or even grade a child’s work to the level an educator will.
In an era when districts are making it easier and easier to be placed in charge of a classroom, perhaps this area of expertise is one we would be advised to hold onto more tightly.
______________________________________________________________
*Speaking from my own experience with AI, it can be a valuable tool for creating lesson plans and performance tasks, but the success of this depends greatly on how the prompt to AI is framed (i.e. what you ask AI to produce) and prompts almost always need to be refined multiple times to achieve a good result. Even after it produces a usable result, it often requires editing in a number of areas. ChatGPT produces rubrics for teachers but these often have too many categories with indicators where the distinctions are somewhat meaningless; it also suggests literature that is often not appropriate for the grade level or not suitable for the task. Its best use seems to be as a generator of ideas and frameworks that need considerable editing — the expertise and content knowledge of the teacher is still a critical component in creating a usable task or lesson plan.
**This figure may be too low. A K-12 Dive survey of 1000+ teachers found that in 2024, 67% of teachers were using AI , though these usage figures also include consulting AI for personal reasons as well as professional; 39% of teachers were using AI specifically to catch student plagiarism.


