New Research Reveals Whether AI Can Judge Teaching as Well as People

Findings Show Machine Learning Excels at Analyzing Classroom Talk, Falters on Context

March 9, 2026   |   By Karen Rivedal, Office of Research & Scholarship

New research from Courtney Bell shows automated scores of teaching effectiveness often align closely with human ratings.

New research from Courtney Bell shows automated scores of teaching effectiveness often align closely with human ratings.

A new study co-authored by UW–Madison School of Education faculty member Courtney Bell explores whether artificial intelligence (AI) can help assess teaching quality using real classroom data and do it as well or even better than traditional human ratings, which can be time-consuming, expensive and inconsistent.

The international team of researchers used a type of AI model known as machine learning to analyze data from 92 math lessons on quadratic equations that were video-recorded in German schools. The participants included 46 teachers serving 1,132 middle school students, who also took achievement tests before and after the lessons.

The team’s goal was to see whether machine learning could produce reliable and meaningful scores on 18 widely used indicators of teaching effectiveness, including classroom management, student support, instructional clarity, discourse, and cognitive engagement. The study findings highlight both the promise and limitations of automated approaches to assessing teaching.

“This study shows that machine learning can provide surprisingly reliable insights into certain aspects of teaching, especially how teachers and students talk and think together during lessons,” said Bell, a professor of learning sciences and a researcher whose work is based at the school’s Wisconsin Center for Education Research. “Our findings also make clear that AI still struggles to interpret the more subtle, situational parts of classroom life.”

Machine learning is a type of AI that uses specific methods and algorithms to enable computer systems to learn from data, identify patterns, and make decisions independently.

What the Researchers Found

The study showed that the automated scores often aligned closely with human ratings and sometimes outperformed them. Machine learning models using text and audio were more reliable than human scores on 11 of the 18 measures, including classroom discourse, clarity, and eliciting student thinking. These areas rely heavily on language cues, which machine learning analyzes with “fine-grained precision,” the study found, whereas human raters may overlook or interpret the same cues inconsistently.

The study, identified as the first to assess many elements of teaching effectiveness in a multimodal way (using text, audio and classroom video) with machine learning, also tested whether experts found the machine learning-generated scores believable.

In many cases, they did: two trained reviewers judged the automated scores as “slightly more plausible” than human ones, on average, the study found. But the reviewers preferred human scores in elements of classroom management, such as judging how well teachers established routines and handled disruptions — suggesting humans are still better at interpreting complex, sometimes ambiguous behaviors for which context and nuance can matter.

“Human professional judgement remains essential,” Bell said. “The promise here is real —but so is the need for caution as we explore how these tools might support educators.”

The researchers also examined whether machine learning scores predicted students’ math achievement on a post-test better than the human-generated scores did. Those results were mixed as well. Some teaching-effectiveness measures showed positive associations with machine learning, others were negative, and many showed none.

Limitations and Cautions Remain Important

The authors said the study overall provides valuable insights for the development of automated feedback. But machine learning is not ready to replace human judgment, they said. The models learn from human ratings, which themselves contain inconsistencies.

Automated systems could someday help provide “comprehensive and immediate insights,” the authors said in the paper, but real classroom use is still “a distant goal.” More research around machine learning is needed, the paper noted, especially to protect privacy, reduce bias, and move the technology beyond simply mimicking human scorers.

“As we continue this work, it’s important to recognize both the opportunities for scalable research and limitations that make AI unsuitable for high-stakes decisions,” said Bell. “I’d encourage readers to see this as an early step toward tools that might one day assist — not replace — educators and researchers.”

The math lessons analyzed by researchers for this study came from data included in an earlier international study in which Bell participated known as Global Teaching InSights: A Video Study of Teaching. Bell’s co-authors on the machine learning study are seven researchers from three academic institutions in southern Germany: Tim Fütterer, Ruikun Hou, Babette Bühler, Efe Bozkir, Enkelejda Kasneci, Peter Gerjet, and Ulrich Trautwein.

About WCER

The Wisconsin Center for Education Research (WCER) at UW–Madison’s #1-ranked School of Education is one of the oldest and most productive education research centers in the world. It has assisted scholars and practitioners in developing, submitting, conducting and sharing grant-funded education research for over 60 years. Learn more at wcer.wisc.edu.