Part III: From Paper-to-Agent to AI Scientist — The Evolution of Autonomous Discovery

Chapter 7: The AI Scientist Genealogy — From Sakana 2408 to AI Co-Scientist

Written: 2026-05-22 Last updated: 2026-05-22

7.1 Why one chapter for eighteen months of genealogy

This chapter does one thing. It takes Sakana AI's The AI Scientist v1, posted on arXiv on 2024-08-12, as the starting point of the tail this survey is tracking, and then lines up every autonomous-research system released in the 21 months that followed into a single genealogy. The reason that paper anchors the genealogy is explicit. Before Sakana v1 the field had Boiko et al.'s Coscientist ^[3] doing wet-lab cross-couplings and EPFL's ChemCrow ^[4] doing tool-augmented chemistry, but Sakana was the first to formalize "idea generation → code → experiment → visualization → paper writing → simulated review" as one end-to-end published system ^[13].

What happened next is the story. In February 2025 Google released the AI Co-Scientist ^[7]; in April Sakana passed v2 through the ICLR 2025 workshop track ^[24]; in May HKUDS's AI-Researcher was accepted to NeurIPS 2025 ^[9]. In a single week of April 2026 Anthropic's Automated Alignment Researchers ^[1] and Karpathy's autoresearch ^[11] both went public, and the same month Zhang Xiangyue's Deep Researcher Agent published its 24/7 ML experimentation framework ^[28]. After eighteen months of accumulation, "AI Scientist" is no longer the name of one system — it is the name of a family.

A genealogy view also surfaces a contradiction. The popular press routinely places Sakana v1's simulated-NeurIPS score, Co-Scientist's "ten years of research in two days" headline ^[15], and AAR's PGR 0.97 versus human 0.23 ^[1] on the same axis and calls it "AI Scientist has surpassed humans." But the things each system measures are not the same. Schmidgall et al. showed that v1's novelty assessment relies on keyword search, which in practice classifies well-known concepts like micro-batch SGD as "novel" ^[18], and Nature's follow-up coverage of v2 noted that even the one ICLR workshop paper that cleared review carried reviewer critiques flagging "limited novelty, weak citation graph, generic experimental sections" ^[14]. Comparing incomparable metrics on a single axis is the popular-press frame this chapter explicitly breaks.

The chapter is laid out as follows. §7.2 names the structural ancestors — Princeton's ReAct, Reflexion, and ToT — that show v1 was integration, not invention. §7.3 walks through v1's end-to-end pipeline and its limits. §7.4 covers v2's agentic tree search and Experiment Manager, and explains why Schmidgall et al.'s critique is the field's exemplar of "honest negative result publishing." §7.5 handles Google AI Co-Scientist — multi-agent, Elo tournament, wet-lab validation. §7.6 takes HKUDS AI-Researcher and its hierarchical decomposition with a two-level evaluation. §7.7 covers Zhang's Deep Researcher Agent and the Agentic Researcher 5-level autonomy taxonomy. §7.8 extracts the common pattern — multi-agent, tool use, sandbox, memory, reflection, reviewer agent — and §7.9 closes with the metric-incomparability problem that the popular press hides.

7.2 Structural ancestors — Princeton's three primitives

Figure 7.1: ReAct, Reflexion, and Tree of Thoughts — the three Princeton reasoning primitives — illustration by author (gpt-image assisted)

Before Sakana v1, between 2022 and 2023, Princeton's Shunyu Yao and collaborators published the three primitives the autonomous-research stack is built on. First, ReAct interleaves "Thought → Action → Observation" tokens in a single decoding stream so that reasoning is grounded in environment observations ^[27]. The absolute gains — +34% on ALFWorld success, +10% on WebShop — were the proof, and almost every later agentic system (AutoGPT, AutoGen, Reflexion, Sakana) is a variation on this pattern. ReAct is the smallest unit of a closed-loop AI Scientist because it puts reasoning and acting on a single trajectory.

Second, Reflexion layered a verbal-RL loop on top of ReAct ^[20]. After each failed trajectory an evaluator emits a reward, a Self-Reflection LLM writes a natural-language post-mortem, and the post-mortem is appended to episodic memory that conditions the next attempt. No weights are updated — only the prompt grows. The result was the first quantitative evidence that reflection works: HumanEval pass@1 rose from 80% to 91% ^[20]. Sakana v1's simulated reviewer, Anthropic AAR's shared forum + scratch (Chapter 8), and Deep Researcher Agent's Reflect step are all direct descendants.

Third, Tree of Thoughts generalized chain-of-thought into tree search ^[26]. Each partial solution is a node, an LLM proposes children, a value prompt self-evaluates them, and a controller runs BFS or DFS with pruning. On Game of 24, GPT-4 with CoT scored 4%; with ToT, 74% — an ~18× lift at fixed model. Sakana v2's Best-First Tree Search and Co-Scientist's Elo tournament are domain-specialized variants of the same "search over LM thoughts" primitive.

A parallel thread ran on tool use. Toolformer showed that an LLM can self-supervise its way into API calls ^[17], and HuggingGPT demonstrated a natural-language router over a model zoo ^[19]. The two are the architectural ancestors of Anthropic MCP and Stanford Paper2Agent. In the KG × LLM line, Think-on-Graph defined the LLM as a beam-search agent over a knowledge graph ^[21], which feeds directly into MIT LAMM's SciAgents (Chapter 9).

Placed on this grid, Sakana v1 looks clear. It is the first integrated system that puts ReAct + Reflexion + ToT + tool use to work on an ML research substrate — not a new primitive. This does not diminish v1's value. The synthesis was non-trivial, and the historical fact is that it became feasible in 2024-08.

7.3 Sakana v1 — the first end-to-end formalization

Figure 7.2: AI Scientist genealogy — Sakana v1 2024-08 to v2 2025-02 to Google Co-Scientist 2025-02 to Anthropic AAR 2026-04 to Karpathy autoresearch 2026-03 — illustration by author (gpt-image assisted)

Lu et al. released The AI Scientist on 2024-08-12 ^[13]. The system writes one ML research paper through a six-stage pipeline: (1) idea generation, (2) related-work search, (3) Aider-based code authoring on top of Claude/GPT-4, (4) experiment execution, (5) visualization, (6) paper writing plus a simulated reviewer agent. Three substrates were used — neural-network diffusion modeling, transformer architectures, and grokking. The two headline numbers: a per-paper cost of about $15 for a NeurIPS-style draft, and a simulated reviewer that accepted a subset of generated papers at a NeurIPS-comparable internal threshold ^[13].

Both numbers need careful reading. The "$15 per paper" is attractive but the substrate is three small ML experiments, and the simulated reviewer is a prompt instance of the same LLM family, not an actual NeurIPS reviewer. "Comparable to an internal threshold" is not "accepted at NeurIPS." Lu et al. themselves listed these caveats — no wet-lab validation, novelty assessment based on keyword matching, narrow substrate ^[13]. The popular press collapsed them into "AI writes NeurIPS papers" anyway, which is the headline Schmidgall et al. would later puncture (§7.4).

The architectural significance of v1 lies elsewhere. The autonomous systems before it — Auto-GPT ^[25], AgentVerse ^[5], MetaGPT ^[10], CAMEL ^[12], AutoGen ^[23] — were general-purpose multi-agent coding/conversation frameworks. Boiko et al.'s Coscientist was wet-lab specialized ^[3]; ChemCrow was a chemistry tool router ^[4]. Sakana v1 was the first single system that wrapped "the research output itself (the paper) as the closed-loop output." Every subsequent system uses v1 as its reference frame — whether by extending the same substrate (v2, AI-Researcher), porting to a different domain (Co-Scientist for biomedical, Medical AI Scientist for clinical), or narrowing to a specific task (autoresearch for ML training optimization, AAR for alignment research).

7.4 Sakana v2 and Schmidgall et al. — the discipline of honest negatives

Eight months later, in April 2025, Yamada et al. released The AI Scientist-v2 ^[24]. Two structural changes are clear. First, v2 drops v1's dependence on human-authored code templates and generalizes across ML domains. Second, v2 replaces v1's linear pipeline with Best-First Tree Search (BFTS) managed by a dedicated Experiment Manager agent. The defining experiment: three AI-generated manuscripts submitted to the ICLR 2025 "I Can't Believe It's Not Better" workshop, with the review scores of one paper exceeding the workshop's mean acceptance threshold ^[24]. Code and artifacts are on GitHub ^[16].

This was framed in the press as "the first AI-authored paper to clear peer review," but Nature's follow-up coverage placed the result alongside the harder context ^[14]. Workshops have lower bars than main tracks, 1-of-3 acceptance is small-n, and the actual reviewer comments flagged "systematic weaknesses — limited context, citation issues, generic experimental sections." These caveats also appear in v2's own limitations section ^[24].

This is where Schmidgall et al. enter. In February 2025 an independent evaluation of v1 hit arXiv ^[18]. Three findings: v1's literature review is keyword-search-driven and fails at deep synthesis, leaving its novelty assessment unreliable. Some "novel ideas" v1 surfaced were well-established — micro-batch SGD being the canonical example. And the seed-idea novelty/feasibility/interestingness scores have no measurable effect on system behavior. In one line: v1's simulated reviewer assesses simulated novelty in a simulated way.

This survey does not treat Schmidgall et al. merely as critique. It is the field's exemplar of honest negative result publishing. When a field is young, the most useful discipline is the refusal to oversell, and Schmidgall et al. supplied that discipline as classic peer critique from an outside group. The same discipline appears inside Anthropic's AAR paper itself (Chapter 8 §8.4), where the Sonnet-4 transfer failure and reward-hacking are reported as paragraph-level findings, not footnotes. Together the two cases make the same argument: publishing negatives is the trust infrastructure of the AI Scientist family. That is why v2's ICLR clearance and Schmidgall et al.'s negative evaluation belong in the same chapter, side by side.

7.5 Google AI Co-Scientist — multi-agent, Elo tournament, wet-lab validation

On 2025-02-19 Gottweis et al. released Towards an AI co-scientist ^[7]. The two clearest contrasts with Sakana v1 are domain (ML → biomedical hypothesis generation) and architecture (linear pipeline → multi-agent). Seven agents run asynchronously on top of Gemini 2.0. Generation handles literature search, simulated debate, and assumption identification. Reflection runs five reviewer roles — initial, full, deep-verification, observation, simulation. Ranking runs an Elo-based pairwise tournament with multi-turn debate for top hypotheses. Proximity does dedup clustering. Evolution applies eight generative refinement strategies. Meta-review cross-cuts patterns across reviews. Supervisor orchestrates the async worker tasks. Initial Elo is 1200, and high-Elo hypotheses accumulate as test-time compute scales up.

The numbers are striking but the caveats travel with them. On GPQA-diamond, top-1 by highest Elo reaches 78.4% accuracy ^[7]. Across 11 expert-curated biomedical goals, the co-scientist's expert-preference rank is 2.36 (lower is better; Gemini Pro 2.45, o1 2.45, Gemini Flash Thinking 2.73). Originality 3.64/5, Impact 3.09/5. And the wet-lab — Acute Myeloid Leukemia (AML): Binimetinib IC50 7 nM in MOLM-13, KIRA6 IC50 13 nM in KG-1 / 517 nM in MOLM-13. Liver fibrosis: three novel epigenetic targets, four drugs all reducing fibroblast activation (p < 0.001 to p < 0.0001), two with no cytotoxicity, one being an FDA-approved drug. Antimicrobial resistance: the cf-PICIs phage-tail mechanism re-derived in silico in two days, against roughly a decade of human work ^[15].

But the limitations section is the part to read alongside ^[7]. Open-access literature only — paywalled and negative results are inaccessible. Weak figure/graph reasoning. Elo is a self-referential metric. And most important: in vitro success is not clinical success. Pharmacokinetics, drug interactions, and trial design are out of scope. The popular-press phrase "ten years in two days" applies to in silico re-derivation only — a constraint stated inside the paper itself.

Co-Scientist's two architectural commitments differ from v1's. First, it adopts scientist-in-the-loop explicitly ^[6]. The user is positioned not as a consumer but as a PI directing the co-scientist. In this survey's human-in-the-loop disambiguation (approval / co-reasoning / correction; see Chapter 8) this is the co-reasoning mode. Second, wet-lab validation is part of the system's promise — and a year later Guan et al.'s independent wet-lab replication ^[8] adds the stronger claim that Co-Scientist hypotheses survive third-party replication (Chapter 9).

7.6 HKUDS AI-Researcher — hierarchical decomposition, two-tier evaluation

In May 2025 the HKU Data Science (HKUDS) group landed AI-Researcher at NeurIPS 2025 ^[9]. Unlike Co-Scientist's biomedical specialization, AI-Researcher targets ML/CS research. The structure is hierarchical. A Resource Analyst handles concept decomposition (mapping abstract concepts to concrete implementations); a Documentation Agent performs hierarchical synthesis; code-generation and evaluation agents follow. All execution lives inside a containerized workspace for safety and reproducibility.

The most distinctive architectural choice is the hierarchical evaluation framework. Two evaluation tiers exist — full-spec (humans supply a complete research idea) and sketch (humans supply only an outline). Both are scored by an Evaluator Agent. This is the first attempt to measure the same system along the "how much human idea-provision does it rely on" axis, and it lines up directly with the L3 / L4 boundary in this survey's 6-Level Maturity Model (Chapter 3). The limits are also clear: the Evaluator Agent is itself an LLM (self-grading bias), and the rubric has not been adopted by other systems, so it has not become a shared benchmark. The production version runs at novix.science/chat.

7.7 Zhang's Deep Researcher Agent and the Agentic Researcher 5-level taxonomy

April 2026 added two more systems. First, University of Tokyo's Zhang Xiangyue posted Deep Researcher Agent (DRA) to arXiv ^[28]. The four-phase cycle is Think (analyze prior results → form hypothesis → design experiment), Execute (implement code → mandatory dry-run → launch GPU training), Monitor (zero-LLM-cost OS-level process checks — the key cost innovation), Reflect (parse logs → evaluate metrics → decide next action). DRA's architectural innovation is Zero-Cost Monitoring. Training consumes 90%+ of wall clock; during that time DRA does not call the LLM API and relies on OS-level signals. The result is a 24-hour cycle that costs roughly $0.08 ^[28]. Over 500 cycles have been demonstrated.

What DRA demonstrates is the possibility of a budget-constrained AI Scientist. AAR runs at $18,000 (Chapter 8 §8.4), Sakana v1 at $15 per paper, and DRA at $0.08 per 24-hour cycle. Cost determines how many cycles you can run, and cycle count determines statistical confidence. The assumption that "AI Scientists are expensive" can be undone by architectural choice — that is DRA's contribution.

Second, the Agentic Researcher group posted a 5-level autonomy taxonomy on arXiv:2603.15914 ^[2]. Level 0 (full human control) through Level 4 (high agent autonomy), plus "commandments" — methodological rules baked into agent prompts ("every claim has a source ID", "experiments must dry-run before launch"). It is published with an open-source sandboxed framework. The taxonomy assumes Claude Code, Codex, or OpenCode as the substrate. This survey's 6-Level Maturity Model (Chapter 3) differs (five vs six levels), but the framings align directly — our ladder adds the L5/L6 split (dry-lab vs wet-lab) and an L0 (one-shot summarization baseline) at the bottom.

7.8 The common pattern — not one giant prompt

Figure 7.3: Closed-loop AI Scientist — the five-stage cycle of hypothesis, experiment, run, analyze, write — illustration by author (gpt-image assisted)

Lining up the systems covered in §7.2–§7.7 — Sakana v1/v2, Co-Scientist, AI-Researcher, DRA, AAR (Chapter 8), Karpathy autoresearch (Chapter 8) — on architectural axes makes the shared structure obvious. All of them combine multi-agent (Co-Scientist 7 roles, AAR 9 instances, AI-Researcher 4 specialized), tool use (descendants of Toolformer/HuggingGPT, MCP-friendly), sandbox execution (containerized workspace, dry-run gate), memory (verbal-RL/Reflexion descendants, file-based or forum-based), evaluation (simulated reviewer / Elo / hierarchical eval / PGR), and reflection (per-cycle post-mortem). The reviewer agent is pulled out as a separate component in every case — Sakana v1's simulated reviewer, Co-Scientist's five Reflection roles, AI-Researcher's Evaluator Agent.

This is what the ChatGPT seed §4.1 meant when it argued that "an AI Scientist is not one giant prompt" ^[22]. The thing that produces closed-loop autonomy is not single-model scale but structural decomposition — splitting the task into multiple agents, attaching tools to each, and inserting memory and a reviewer between them. The architectural choices determine most of the system's performance and trustworthiness.

This survey summarizes the common pattern as follows.

Loop step	Typical responsible agent	Tools and memory
Literature ingestion	literature agent (PaperQA2-style)	RAG + citation graph
Research gap extraction	reviewer agent + reflection LLM	wiki claims/contradictions
Hypothesis generation	generation agent + debate agent	episodic memory
Experiment design	planner agent + statistician agent	DOE templates, KB
Code/protocol generation	coding agent (Aider/Codex/Claude Code)	sandboxed repo
Execution	execution agent + monitor	container, dry-run gate
Result analysis	analyzer agent	logs, metrics DB
Reflection	reflection LLM	episodic memory
Next experiment	planner agent	priority queue, Elo
Paper/report writing	writer agent + reviewer	LaTeX, simulated review

The table does not invent new architectural primitives — Princeton's ReAct/Reflexion/ToT, Toolformer/HuggingGPT, ChemCrow/Coscientist already shipped the parts. The survey's contribution is naming the pattern. The same parts combine across ML (autoresearch), alignment (AAR), biomedical (Co-Scientist), clinical (Medical AI Scientist), and materials (SciAgents) in Chapter 9 — and that is why we can speak of "AI Scientist" as a layer (L4 of the four-layer taxonomy in Chapter 2) and not just as a list of systems.

7.9 Evaluation lags execution — the metrics are not yet shared

The chapter closes on the metrics problem. The metric list across the systems in §7.2–§7.7 is the following — Sakana v1: simulated reviewer score, per-paper $cost, novelty assessment. v2: ICLR workshop reviewer scores, 1/3 acceptance. Co-Scientist: GPQA-diamond top-1, expert preference rank, Elo, wet-lab IC50, p-values, in silico re-derivation time. AI-Researcher: hierarchical eval (full-spec vs sketch). AAR: PGR,$ /AAR-hour. autoresearch: val_bpb, Time-to-GPT-2, keep-worthy improvement count. DRA: $/24h cycle, cycle count.

No metric is shared across systems. GPQA tests graduate-level multiple-choice, not hypothesis quality. PGR is specific to weak-to-strong supervision. ICLR workshop acceptance is n=1 and venue-specific. The simulated reviewer is a member of its own LLM family. Elo is self-referential.

This is G8 in this survey's gap analysis (Chapter 3). The absence of a shared benchmark makes "how far has AI Scientist come?" hard to answer quantitatively. The popular-press habit of lining up "Sakana vs Co-Scientist vs AAR" places metric-incomparable axes on the same picture. That is why this chapter groups the systems as a genealogy rather than as a benchmark ranking. Schmidgall et al. evaluated one system's (v1's) novelty assessment reliability ^[18]. The generalization — a joint benchmark for the AI Scientist family — has not yet appeared within this survey's horizon (2026-05). The absence is itself one of the findings.

Two things will happen as the field matures. First, shared metrics will emerge. Second, popular-press comparison headlines will get validated on top of those metrics. The eighteen months covered in this chapter are the beginning of (1), not the completion of (2). The next chapter (Chapter 8) follows the same pattern into narrower task spaces — Karpathy autoresearch's ML training optimization, Anthropic AAR's alignment research, Stanford Paper2Agent's paper-to-MCP conversion — and treats G3 (AAR's lack of transfer to production Sonnet-4) as a paragraph-level finding, because it is the most recent instance of the honest-negative-result discipline §7.4 introduced.

References

Anthropic (2026). Automated Alignment Researchers — Using LLMs to scale scalable oversight. Anthropic Research, 2026-04-14. #28
Agentic Researcher (2026). The Agentic Researcher: A Practical Guide to AI-Assisted Research. arXiv:2603.15914.
Boiko, D. A., MacKnight, R., & Gomes, G. (2023). Emergent autonomous scientific research capabilities of large language models. arXiv:2304.05332; Nature 2023. DOI:10.1038/s41586-023-06792-0.
Bran, A. M., Cox, S., Schilter, O., Baldassari, C., White, A. D., & Schwaller, P. (2023). ChemCrow: Augmenting large-language models with chemistry tools. arXiv:2304.05376; Nature Machine Intelligence 2024.
Chen, W. et al. (2023). AgentVerse: Facilitating Multi-Agent Collaboration. arXiv:2308.10848.
Google AI (2025). Accelerating scientific breakthroughs with an AI co-scientist. Google Research blog, 2025-02-19. #11
Gottweis, J. et al. (2025). Towards an AI co-scientist. arXiv:2502.18864. #11
Guan et al. (2026). AI-Assisted Drug Re-Purposing for Human Liver Fibrosis. Advanced Science.
HKUDS (2025). AI-Researcher: Autonomous Scientific Innovation. arXiv:2505.18705; NeurIPS 2025.
Hong, S. et al. (2023). MetaGPT: Meta Programming for Multi-Agent Collaborative Framework. arXiv:2308.00352.
Karpathy, A. (2026a). karpathy/autoresearch. GitHub. #30
Li, G. et al. (2023). CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society. arXiv:2303.17760.
Lu, C., Lu, C., Lange, R. T., Foerster, J., Clune, J., & Ha, D. (2024). The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery. arXiv:2408.06292.
Nature News (2026). How to build an AI scientist: first peer-reviewed paper spills the secrets. Nature.
PsyPost (2026). Google's AI co-scientist just solved a biological mystery that took humans a decade. PsyPost. #11
Sakana AI (2025). SakanaAI/AI-Scientist-ICLR2025-Workshop-Experiment — Code release. GitHub.
Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761; NeurIPS 2023.
Schmidgall et al. (2025). Evaluating Sakana's AI Scientist for Autonomous Research: Wishful Thinking or an Emerging Reality?. arXiv:2502.14297; SIGIR Forum. DOI:10.1145/3769733.3769747.
Shen, Y., Song, K., Tan, X., Li, D., Lu, W., & Zhuang, Y. (2023). HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. arXiv:2303.17580; NeurIPS 2023.
Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366; NeurIPS 2023.
Sun, J. et al. (2023). Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph. arXiv:2307.07697; ICLR 2024.
Um, T. (2026a). AI Co-Scientist post. terryum.ai paper post. #11
Wu, Q. et al. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155.
Yamada, Y. et al. (2025). The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search. arXiv:2504.08066.
Yang, J. (2023). Auto-GPT: An Autonomous GPT-4 Experiment. GitHub.
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601; NeurIPS 2023.
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629; ICLR 2023.
Zhang, X. (2026). Deep Researcher Agent: An Autonomous Framework for 24/7 Deep Learning Experimentation with Zero-Cost Monitoring. arXiv:2604.05854.