World's First Fully Automated AI Scientist Debuts! Westlake University's New System Outperforms Humans by 183.7%

World's First Fully Automated AI Scientist Debuts! Westlake University's New System Outperforms Humans by 183.7%

Published:October 9, 2025
Reading Time:2 min read

Want to read in a language you're more familiar with?

In a landmark release from Westlake University, this system has already surpassed human performance by 183.7%. Westlake University's NLP Lab has introduced De...

In a landmark release from Westlake University, this system has already surpassed human performance by 183.7%.

Westlake University's NLP Lab has introduced DeepScientist, the first AI scientist capable of full, autonomous research. This system demonstrates goal-driven, iterative scientific discovery, progressively surpassing top human experts without human intervention.

In an AI text detection task, DeepScientist implemented and tested over 1,000 hypotheses in just two weeks, achieving progress equivalent to three years of human effort. On the RAID dataset, its method delivered a 7.9% AUROC improvement, outperforming the existing human State-of-the-Art (SOTA). It also achieved new SOTA results in agent failure attribution and LLM reasoning acceleration.

Unlike previous AI systems that required clear directives and often produced low-value output, DeepScientist actively identifies limitations in cutting-edge research, proposes novel ideas, writes code, runs experiments, and drafts complete papers. This shift from random discovery to sustained, proactive exploration marks AI's entry into the most creative realms of science.

DeepScientist formalizes discovery as a hierarchical Bayesian optimization problem, aiming to maximize valuable findings within a set budget. It operates a three-tier evaluation loop, where ideas are tested at increasing levels of fidelity and cost. Promising findings advance for further resourcing, while others inform future exploration via a "Findings Memory," ensuring efficient resource allocation.

Tested on competitive tasks like AI text detection and agent failure attribution, DeepScientist's performance was remarkable. Beyond the RAID dataset success, it conceived a novel A2P method for failure attribution, boosting performance on the Who&When benchmark by 183.7% over the human SOTA.

Crucially, DeepScientist excels in low-success-rate environments. Its structured approach balances exploration and exploitation, enabling steady progress where brute-force methods fail. Experiments also revealed a "scaling law" for discovery: increasing GPU resources linearly increased weekly high-impact findings.

DeepScientist signifies a new paradigm of human-AI collaboration, where AI acts as an exploration engine, freeing human scientists to define profound questions and provide ultimate judgment. The team is open-sourcing the core system to accelerate this vision.