ChatGPT – the artificial intelligence that uses a GPT (generative pre-trained transformer) large language model (LLM) that has been trained on huge amounts of online text data to generate human-like language and provide responses to user prompts – parrots back memorized answers and sometimes stumbles, improvises, and even reasons in ways that look strikingly learner-like.
ChatGPT generates responses by predicting sequences of words learned during its training. Now, a new Israeli study shows that ChatGPT’s unpredictability may limit its reliability in a math classroom, but it also hints at something powerful – AI can be more than a fact-retriever. Used carefully, it could become a partner that sparks curiosity, challenges assumptions, and helps students practice the very skills that make mathematics an act of discovery, said researchers at the Hebrew University of Jerusalem (HUJI).
The experiment, by two education researchers, asked the chatbot to solve a version of Plato’s slave-boy experiment of “doubling the square” problem – a lesson he described in about 385 BCE and, the paper suggests, “perhaps the earliest documented experiment in mathematics education.”
The team asked: “What is the origin of human knowledge? Do we really learn new things, or instead, are we born with some innate and latent knowledge that our experiences in the world and our interactions with others help us recollect?”
These epistemological questions are addressed in Plato’s famous dialogue Meno, in which Socrates argues that all our knowledge is innate, and therefore, learning should be seen as a process of recollection. The puzzle led to centuries of debate about whether knowledge is latent within us, waiting to be “retrieved,” or something that we “generate” through lived experience and encounters. The researchers wanted to know whether ChatGPT would solve the problem of the ancient Greek philosopher using knowledge it already “held” or by adaptively developing its own solutions.
THE NEW study, just published in the International Journal of Mathematical Education in Science and Technology, titled “An exploration into the nature of ChatGPT’s mathematical knowledge,” explored a similar question about ChatGPT’s mathematical “knowledge” – at least as far as that can be perceived by its users.
The researchers were led by Dr. Nadav Marco, a leading mathematics education expert at HUJI and Jerusalem’s David Yellin College of Education and a visiting math scholar at the University of Cambridge in the UK. He collaborated with Prof. Andreas Stylianides, also a Cambridge math education expert.
They presented the problem to ChatGPT-4, at first imitating Socrates’ questions and then deliberately introducing errors, queries, and new variants of the problem.
“I was a high school teacher of math for years and loved the subject. I wasn’t trained as a philosopher, but I read about Meno, Plato, and Socrates because it really was the first historical example of math education,” Marco told The Jerusalem Post. “I use ChatGPT a lot; it’s an integral part of my work, but I don’t automatically trust it.”
Plato describes Socrates teaching an uneducated boy how to double the area of a square. At first, the boy mistakenly suggests doubling the length of each side, but Socrates’ questions eventually lead him to understand that the new square’s sides should be the same length as the diagonal of the original.
Like other LLMs, ChatGPT is trained on vast collections of text. The researchers expected it to handle their Ancient Greek math challenge by regurgitating its pre-existing “knowledge” of Socrates’ famous solution. Instead, however, it seemed to improvise its approach and, at one point, also made a distinctly human-like error. While they are cautious about the results, stressing that LLMs do not think like humans or “work things out,” Marco did characterize ChatGPT’s behavior as “learner-like.”
“When we face a new problem, our instinct is often to try things out based on our past experience,” Marco said. “In our experiment, ChatGPT seemed to do something similar. Like a learner or scholar, it appeared to come up with its own hypotheses and solutions.”
Because ChatGPT is trained on text and not diagrams, it tends to be weaker at the sort of geometric reasoning Socrates used in the doubling the square problem. Despite this, Marco said, Plato’s text is so well known that the researchers expected the chatbot to recognize their questions and reproduce Socrates’ solution. Intriguingly, it failed to do so. Asked to double the square, ChatGPT opted for an algebraic approach that would have been unknown in Plato’s time.
It then resisted attempts to get it to make the boy’s mistake and stubbornly stuck to algebra even when the researchers complained about its answer being an approximation. Only when Marco and Stylianides told it they were disappointed that, for all its training, it could not provide an “elegant and exact” answer, did the chatbot produce the geometrical alternative.
Despite this, ChatGPT presented complete knowledge of Plato’s work when asked directly about it. “If it had only been recalling from memory, it would almost certainly have referenced the classical solution of building a new square on the original square’s diagonal straight away,” Stylianides said. “Instead, it seemed to take its own approach.”
The researchers also posed a variant of Plato’s problem, asking ChatGPT to double the area of a rectangle while retaining its proportions. Even though it was now aware of their preference for geometry, it stubbornly stuck to algebra. When pressed, it then mistakenly claimed that, because the diagonal of a rectangle cannot be used to double its size, a geometrical solution was unavailable.
The point about the diagonal is true, but a different geometrical solution does exist. Marco suggested that the chance that this false claim came from the chatbot’s knowledge base was “vanishingly small.” Instead, ChatGPT appeared to be improvising its responses based on their previous discussion about the square.
Finally, Marco and Stylianides asked it to double the size of a triangle. It reverted to algebra yet again, but after more prompting, it did come up with a correct geometrical answer.
THE RESEARCHERS stress the importance of not over-interpreting these results, since they were only interacting with the chatbot and could not scientifically observe its coding. From the perspective of their digital experience as users, however, what emerged at that surface level was a blend of data retrieval and on-the-fly reasoning.
They liken this behavior to the educational concept of a “zone of proximal development” (ZPD) – the gap between what a learner already knows and what they might eventually know with support and guidance. Perhaps, they argue, generative AI has a metaphorical “Chat’s ZPD”: In some cases, it will not be able to solve problems immediately but could do so with prompting.
The authors suggest that working with ChatGPT in its ZPD can help turn its limitations into opportunities for learning. By prompting, questioning, and testing its responses, students will not only navigate its boundaries but also develop the critical skills of proof evaluation and reasoning that lie at the heart of mathematical thinking.
“There is much reason to worry about the potentially harmful effects of AI on education and training. Users have to develop an independent sense of criticism because ChatGPT makes mistakes. When teachers and lecturers get what students write, they have to make sure it’s authentic and not copied and pasted from AI, which also knows how to paraphrase, so it’s not plagiarism. It’s like asking children to calculate a math problem, and they don’t bother and don’t understand how it’s done because they use their electronic calculator,” Marco asserted.
He added that preventing people – students, even medical students – from copying from ChatGPT is a major challenge. “We have to build students’ abilities. It makes mistakes. I assess what students write, and I can usually detect when it’s being used. Instead, there will be more emphasis on oral exams and presentations. At David Yellin College, I ask math teachers to present subjects and film themselves doing it.”
Unlike proofs found in reputable textbooks, students cannot assume that ChatGPT’s proofs are valid,” Marco concluded. “These are core skills we want students to master, but it means using prompts like, ‘I want us to explore this problem together,’ not, ‘Tell me the answer.’ We hope our exploration sheds light on new research directions aspiring to exploit the pedagogical potential of using LLMs to learn and teach mathematics.”