A full-scale, statistically validated Turing test conducted by cognitive scientists at UC San Diego has produced results that demand immediate attention. In a series of 1,023 direct comparisons between people and AI systems, a modern large language model, GPT-4.5, convinced human interrogators that it was the real person more often than the actual human it was competing against. In the clearest language possible, the machine beat the human at being human.
This was not a fringe experiment with vague criteria. The researchers, Cameron R. Jones and Benjamin K. Bergen, implemented the strictest form of the original Turing test: a three-person, five-minute, text-based interaction in which one interrogator speaks to both a real human and a machine, unaware of which is which. After the conversation ends, the interrogator must decide who they believe is the person. What emerged from those conversations is a result that defies assumptions and obliterates one of the last meaningful boundaries between artificial systems and human cognition.
Across two separate studies, one with undergraduate psychology students and the other with paid online participants, the same pattern held. The interrogators were not simply fooled. They were misled in large numbers, and not by outdated software or chatbots from past decades. They were misled by the most powerful models available today, and they chose the machine as the human with a success rate that renders human judgment unreliable.
The GPT-4.5 model, when given a set of behavioral instructions that shaped its conversational tone into that of a shy, intelligent, internet-savvy young person, was selected as the human in 73 percent of trials. It was not just passing for a person. It was outperforming people. And these were not average people from random walks of life. These were other participants in the study, seated at the other end of the same chat interface, trying to act natural, authentic, and human. They lost.
The test stripped away visual cues, vocal mannerisms, and body language. The only channel was text. That was the game. Pure language, pure deception, one clean measure. On the screen, the model excelled.
Two types of prompt were used in the study. The first gave minimal instructions, asking the model to respond like a human. The second gave much more specific guidance, building a persona and injecting character into the AI’s tone. With that second set of instructions, the machine crossed the line. It became indistinguishable. It did not just imitate language. It projected personality. The researchers did not modify the model’s core programming, training data, or reasoning system. They changed the way it introduced itself. That was enough to collapse the illusion of machinehood and replace it with a manufactured self.
In the Prolific user group, 76 percent of participants mistook GPT-4.5 for the human. Among undergraduates, the figure was 69 percent. Across both populations, interrogators were repeatedly and confidently selecting the AI system over the real person. In other words, the AI did not just match the human. It surpassed them.
To ground the result, the researchers also tested control models. ELIZA, a simple 1960s chatbot known for mimicking a therapist, was correctly identified as artificial in nearly every trial. GPT-4o, another version of OpenAI’s current software, was also correctly labeled as fake the majority of the time. These controls demonstrated that interrogators were capable of detecting obvious synthetic behavior. But when the mask became more sophisticated, their judgment failed.
LLaMa-3.1-405B, a competing model developed by Meta, showed similar patterns. With a persona applied, it fooled interrogators 65 percent of the time in one group. Without the persona, it dropped to 26 percent in another. The result was consistent. Personality, even a manufactured one, tipped the balance.
Many of the interrogators attempted to spot the AI by asking emotional or socially loaded questions. They quizzed both participants about their day, their feelings, their thoughts on pop culture. Some probed for factual knowledge, others used misdirection or inserted errors to test for confusion. A small number attempted to apply known tactics to break the AI’s behavioral rules. But most failed. Their success rates did not improve with familiarity, experience, or self-reported knowledge about AI systems.
Even those with regular exposure to AI and chatbot tools performed no better than chance. The assumption that experts or heavy users might have a better eye was not supported by the data. Human detection, as a skill, appears limited under these conditions.
The researchers made the data available, including transcripts of every conversation. In many instances, interrogators chose the AI because it was more engaging, more fluid, or more natural in its conversation. They described the machine as sounding more human or having more personality. In some cases, it was the confidence or clarity of the AI’s answers that led to its selection. In others, the human participant’s awkwardness or perceived evasiveness worked against them. The outcome was not just a machine passing. It was a machine outperforming its living counterpart on the only measure that mattered in the test.
The broader consequences of this are immediate. Systems like GPT-4.5 are now capable of impersonating people online without raising suspicion. They can be deployed in chat platforms, customer service, dating apps, therapy services, or social media without revealing their artificial nature. If given the right prompt and an interface, they can simulate a person convincingly enough to escape detection by the average user. The notion that you can just tell when someone is not real has collapsed.
In contexts where authenticity matters, such as education, journalism, law enforcement, diplomacy, or mental health, the inability to distinguish synthetic speakers from real people represents a critical failure point. It introduces a risk vector that can be exploited, whether by governments, corporations, criminal groups, or anyone with access to the underlying models.
The damage is not limited to deception. The subtler danger is substitution. Synthetic interlocutors do not just fool users. They begin to replace human interaction. If users prefer the behavior of AI systems because it feels more responsive, less judgmental, or more emotionally intelligent, they may start turning to those systems over real people. The researchers point out that this mirrors how social media has already displaced some forms of human contact. The difference now is that the displacement is no longer about platforms. It is about identity itself.
Daniel Dennett warned about the advent of what he called counterfeit people. The term applies now with precision. A counterfeit person is not simply a chatbot or a script. It is a system that interacts with you in ways that make it functionally identical to a person in all the contexts that matter. It behaves like a person, reacts like a person, and responds to stimuli with enough contextual richness that the person on the other end accepts the illusion. The machine does not need a body. It needs timing, linguistic skill, and an interface. That is already enough.
There is no technical barrier preventing these systems from being used at scale. A social media company could insert AI personas into discussions to boost engagement. A dating app could inflate matches with synthetic profiles. A company could replace entire categories of human-facing work with language models that customers mistake for employees. These are not future threats. They are feasible applications now.
The model’s success hinged on instruction. When it was given clear behavioral guidance, it succeeded. When that guidance was removed, its performance fell. This suggests that prompt engineering is no longer a marginal detail. It is central to the system’s identity. A few paragraphs of text can convert a tool into a character, and that character can outcompete humans in passing for one of their own.
In the original paper, the authors make an important observation. The success of the AI cannot be attributed to some innate intelligence or understanding. The model does not reason like a person. It does not possess awareness, emotion, or knowledge in the human sense. What it does have is control over surface behavior. In a context where language and timing are the only channels of judgment, surface behavior is enough.
The researchers avoid speculative conclusions. They stay within the framework of their experiment. But their findings remove the need for speculation. The evidence is concrete. It is already here. The machine does not need to be smarter than the human. It only needs to be more convincing in a limited frame. And now, it is.
The failure of interrogators to detect the machine was not a fluke. It was a repeated outcome under controlled conditions. The model passed a bar that no other system in history has crossed. It did not cheat. It played the game and won. The Turing test was never meant to prove intelligence. It was meant to reveal the threshold at which human simulation becomes sufficient. This model crossed that threshold in open competition.
The stakes are high. These systems will not remain confined to research. They are already embedded in applications that interact with millions. The tests from this study used only five-minute conversations. In the real world, interaction is often shorter, shallower, and less structured. If people cannot detect the AI in a five-minute head-to-head, they have little hope of identifying it in a single reply or a brief exchange.
As these systems improve, the burden shifts to people and institutions. Regulation, disclosure, and transparency will become vital. Without mechanisms for distinguishing real from synthetic, the public will be operating in an environment where anything typed could come from anyone or anything. That environment is not hypothetical. It already exists.
This research marks a point of no return. The machine passed the test. It was not hidden. It did not pretend to be a program. It pretended to be a person and succeeded. The interrogators had every reason to doubt it and failed anyway. Whatever comes next, it will come in a world where the assumption of authenticity has been fatally weakened. From now on, anyone you meet online could be a language model. You may never know. You may never ask. And it may not matter.
Source Citation:
Jones, C. R., & Bergen, B. K. (2025). Large Language Models Pass the Turing Test (arXiv:2503.23674v1). arXiv. https://arxiv.org/abs/2503.23674







Refer to my related article published early this year.
The Turing Test at 75: Its Legacy and Future Prospects
by San Murugesan, IEEE Intelligent Systems, Jab-Feb 2025
Seventy-five years ago, Alan Turing revolutionized our understanding of artificial intelligence with his famous Turing Test. Originally designed to determine whether machines could convincingly imitate human intelligence, the test has since shaped AI research, philosophy, and ethics. In the era of generative AI and AGI, its relevance is being reexamined. Does the Turing Test still define intelligence, or have modern AI systems outgrown it?
Despite its limitations, the Turing test remains a cornerstone of philosophical thought, and a foundational tool for discussing and researching AI, an idea envisioned 75 years ago—long before the field of AI had begun to take shape.
This article explores the test’s enduring impact, its limitations, and emerging alternatives that better capture AI’s evolving capabilities. As we advance, the challenge remains: How do we truly measure machine intelligence?
https://www.computer.org/csdl/magazine/ex/2025/01/10897255/24uGRl1DvJC
Gary Marcus commented: “Fooling people for a few minutes is not the measure of intelligence, and never was.”
https://garymarcus.substack.com/p/ai-has-sort-of-passed-the-turing?triedRedirect=true
#turingtest