Chapter 3: Timeline + Six-Level Maturity Roadmap
3.1 The same picture on two axes
If Ch2 organized the spatial taxonomy — the four layers of LLM Wiki / Paper-to-Agent / Agentic Research Associate / AI Scientist — Ch3 places the same picture on two different axes. One is time; the other is maturity. The time axis takes "how did the field get here" as an 81-year flow from 1945 to May 2026. The maturity axis takes "where am I now, and what is the next step" as a ladder from L0 to L6. Where the two axes meet, one core claim of this survey falls out. Sakana's The AI Scientist v1 in August 2024 was integration, not invention. That the integration then exploded in one week of April 2026 — the LLM Wiki gist, AAR, and Codex 0.128 trifecta — is no accident either. It is what happens when 81 years of parts arrive together.
This chapter is organized as follows. §3.2 walks the timeline in five segments — pre-history (1945-2009), agent primitives (2020-2023), the opening of the Sakana era (2024-08+), the Karpathy big bang (2026-Q1), and the trifecta month (2026-04). Time weighting follows the user's explicit preference: 2026-04+ is deepest, 2025-12+ is medium, the 2024-08+ tail is lightest. §3.3 introduces the six-level maturity ladder and tabulates exemplar systems, costs, time per cycle, exit criteria, and prerequisites for each level. §3.4 carries the survey's intellectual-honesty signal G8 as a paragraph — that metrics like Sakana v1's simulated reviewer, v2's ICLR workshop acceptance, Co-Scientist's 78.4% GPQA Diamond top-1, AAR's PGR 0.97, and autoresearch's 11% Time-to-GPT-2 reduction cannot be compared on a single axis, and the fact itself is the field's maturity signal. §3.5 closes with how the reader uses this chapter: self-assess → L_current+1 → entry into Part II / III / IV.
The single task of this chapter is therefore assigning the field both a time coordinate and a maturity coordinate, simultaneously. The point is to let one researcher answer "where am I now, and where should I go" in a quantitatively grounded way.
3.2 Timeline — eighty-one years of parts arrived together in April
3.2.1 Pre-history (1945-2009) — half the parts pre-date the LLM
The idea of an autonomous scientist exists eighty years before the LLM did. Vannevar Bush's July 1945 essay As We May Think in The Atlantic proposed a microfilm desk that would preserve "associative trails" between documents in the way human cognition actually works [13]. From this Memex proposal the direct lineage runs through Engelbart's NLS, Nelson's Xanadu, the World Wide Web, the wiki, Obsidian, and Karpathy's LLM Wiki. Every concept the Part II of this survey touches descends from that 1945 essay.
In 1981 Pat Langley's BACON system rediscovered Kepler's third law, Ohm's law, Snell's law, and the ideal-gas law from raw tabulated data using just six heuristics [37]. In 2009 Schmidt and Lipson's Eureqa generalized the same kind of task with genetic programming + invariance criteria to auto-extract conservation laws from raw time-series [52]. These two are the pre-LLM "symbolic AI scientist" lineage, and every AI Scientist paper since 2024 — Sakana v1, Google AI Co-Scientist, and the rest — names them as the symbolic ancestor.
In 1998 Andy Clark and David Chalmers's The Extended Mind is the philosophical anchor of this survey's external-brain framing [17]. The parity principle — an external resource that reliably plays the functional role of cognition is part of the mind — and the Otto-and-his-notebook thought experiment were established as the survey's narrative anchor already in Ch1. The proposition took 28 years to reach Karpathy's 2026-April LLM Wiki pattern, but by the time it arrived the vocabulary was waiting.
April 2009 was a single month with two events. The King team's Adam was published in Science as the first robot scientist that autonomously generated hypotheses about yeast gene functions and verified them in a wet-lab cell [36], and the same month Schmidt and Lipson's Eureqa appeared [52]. Adam verified twelve of twenty new gene-function hypotheses without human intervention; it is the starting point of the 17-year wet-lab-autonomy lineage that the 2023 Boiko Coscientist and 2026 Pilon RoboChem-Flex both name as their direct ancestor. In other words, L6 (wet-lab AI Scientist) in this survey's ladder already had an existence proof in 2009. What the LLM adds is domain generality, not wet-lab autonomy itself.
3.2.2 The AI-for-science floor — AlphaGo (2016) and AlphaFold (2021)
In January 2016, Silver et al.'s AlphaGo combined a policy network + value network + Monte Carlo Tree Search to exceed human-level Go [56]. The March 2016 Lee Sedol match was 4-1. The significance is not that Go was conquered but that adding search at decision time can substantially improve performance with the same model — shown quantitatively for the first time. The pattern was adopted as direct descendant by Yao et al.'s Tree of Thoughts (2023) and by Google AI Co-Scientist's Elo tournament (2025) [73].
In August 2021, Jumper et al.'s AlphaFold 2 hit a median GDT_TS of 92.4 on CASP14 and effectively solved protein structure prediction [28]. The one-line motivation that AI Scientist papers cite most often — "domain-specific AI already produces scientist-grade output; what we want now is for it to also design the experiments" — comes from AlphaFold. In the six-level model, AlphaFold does not fit any single level cleanly, because it is a vertical AI tool, not a vertical AI scientist. But as the existence proof underwriting L5 and L6, it is a key node on the timeline.
3.2.3 Agent primitives (2020-2023) — four years in which the parts of autonomy were all invented
In May 2020, Lewis et al.'s RAG paper became the canonical baseline that ties dense retrieval + a generation model end-to-end [39]. In August 2022, Izacard et al.'s Atlas with 11B parameters + a jointly-trained dense retriever beat 540B parametric models on Natural Questions by more than three points [27]. Both are explicitly cited as contrast targets in Ch4 of this survey — Karpathy's 2026 LLM Wiki gist defines itself against this baseline with the one-liner "RAG retrieves and re-synthesizes every time; nothing accumulates" [30].
In the same window, four reasoning primitives accumulated. January 2022 Chain-of-Thought (Wei et al.) hit SOTA on GSM8K with a 540B model [66]. October 2022 ReAct (Yao et al.) interleaved thought and action tokens in a single decoding stream and got +34 pp absolute on ALFWorld [72]. March 2023 brought two reflection formulations in the same week — Madaan et al.'s Self-Refine and Shinn et al.'s Reflexion, the latter pushing HumanEval pass@1 from 80% to 91% [55]. May 2023 Tree of Thoughts (Yao et al.) generalized CoT to tree search and lifted Game-of-24 success from 4% to 74% [73]. As Ch7 explains, Sakana v1 in August 2024 was the first system to integrate these four primitives into the ML research substrate — not the inventor of any one of them.
On the memory axis, three systems appeared in the same window. April 2023 Park et al.'s Generative Agents had 25 agents with memory stream + reflection + plan, the first prototypical external cortex [48]. May 2023 Wang et al.'s Voyager paired GPT-4 + Minecraft + automatic curriculum + an executable skill library to discover 3.3× more items and clear the tech tree 15.3× faster [65]. October 2023 Packer et al.'s MemGPT introduced the operating-system metaphor — main context + archival context, self-paged [46]. Karpathy's 2026 LLM Wiki gist generalizes MemGPT's archival store into a git-versioned Markdown vault.
On the multi-agent + tool-use axis, six systems appeared between February and August 2023. Toolformer (Schick et al., self-supervised insertion of API calls) [51], CAMEL (Li et al., role-played multi-agent) [40], HuggingGPT (Shen et al., natural-language router over a model zoo) [54], and then in August: AutoGen (Wu et al.), AgentVerse (Chen et al.), MetaGPT (Hong et al.) — three independent groups crystallizing the "specialized role-played LLMs + message passing" pattern inside thirty days [16]. That Sakana v1 is multi-agent rather than one giant prompt is the direct inheritance of this August cluster.
On the AI-for-science axis, two key cases arrived in two weeks in April 2023. April 11 Boiko et al. (CMU Coscientist) synthesized Suzuki / Sonogashira couplings end-to-end with GPT-4 + Opentrons OT-2 wet-lab [8]; April 12 Bran et al. (EPFL ChemCrow) achieved statistically significant expert-preferred performance with GPT-4 + 18 chemistry tools on retrosynthesis [10]. These two papers fill the 14-year gap between King's Adam (2009) and Boiko et al.'s Coscientist (2023) on this survey's L6 ladder. Sakana v1 (2024-08) and Google AI Co-Scientist (2025-02) both name these two April papers as direct precedents.
The alignment axis in 2022-2023 produced two substrate papers: Bowman et al.'s Measuring Progress on Scalable Oversight (2022-11) [9], and Burns et al.'s Weak-to-Strong Generalization (2023-12) which defined the PGR metric [12]. Anthropic AAR's PGR 0.97 result that Ch8 of this survey covers is the empirical instantiation of this 2022-2023 research program, not a stand-alone Anthropic-blog moment. The same December also saw Rein et al.'s GPQA benchmark, pre-establishing the metric baseline for Co-Scientist's 78.4% top-1 result [50].
3.2.4 The opening of the Sakana era (2024-08+) — the first integration formalized
On August 12, 2024, the Sakana AI / Oxford / FLAIR / UBC team posted The AI Scientist v1 to arXiv:2408.06292 [41]. A six-stage pipeline — ideation → code → experiment → visualization → paper → simulated review — produced one ML-research paper at roughly $15. As Ch7 §7.3 details, v1's architectural significance is not the invention of new primitives but the first integration of the parts seen in §3.2.3 into a closed loop with a research artifact as the output. This is why this survey's tail-tracking begins at 2024-08 — the parts existed before, but the first formalization as a system was v1.
September 2024 added two follow-on cases. Ghafarollahi & Buehler's SciAgents applied KG + multi-agent to materials [20], and FutureHouse's PaperQA2 became the first superhuman literature-search agent over a PhD/postdoc baseline [38]. PaperQA2's companion demo WikiCrow — auto-generating Wikipedia-style articles — is the most direct historical antecedent of the LLM Wiki pattern in this survey (revisited in Ch4).
Q4 2024 was a quiet period as the field digested Sakana v1 and PaperQA2.
3.2.5 2025 — the year the AI Scientist literature crystallizes
Three events happened in the same week of February 2025. On the 19th, Google posted AI Co-Scientist on arXiv:2502.18864 — multi-agent on Gemini 2.0, Elo tournament, wet-lab validation [21]. On the same 19th, this survey's author posted an analysis at terryum.ai/posts/2502-ai-co-scientist [63]. On the 26th, Schmidgall et al. published arXiv:2502.14297 critically evaluating Sakana v1's novelty assessment — including the case where the system rated "rediscovering micro-batch SGD" as novel [53]. As Ch7 emphasizes, Schmidgall et al. is the methodological exemplar for publishing honest negative results in this field.
April 2025 Sakana v2 submitted three manuscripts to the ICLR 2025 "I Can't Believe It's Not Better" workshop, one of which exceeded the mean acceptance threshold [71]. May 2025 HKUDS's AI-Researcher with hierarchical decomposition + two-level evaluation was accepted to NeurIPS 2025 [24]. September 2025 Stanford's Paper2Agent on arXiv:2509.06917 formalized the pattern of converting papers to MCP servers [59]. L2 Paper-to-Agent in this survey's four-layer taxonomy was named in September 2025.
December 2025 was the transition period — this survey's author published Conductor — LLM Orchestration Patterns and the multi-agent design vocabulary settled [62]; Codex's 0.x cycle matured to make the April 2026 0.128 release possible.
3.2.6 2026 Q1 — the Karpathy autoresearch big bang
On March 7, 2026, Andrej Karpathy released the karpathy/autoresearch repository and tweeted the first overnight run — twelve hours mutating nanochat 110 times, val loss from 0.862415 to 0.858039 [30]. A 630-line Python script converted one person's sleeping hours into one ML research cycle. The Round 1 tweet two days later was more striking: 700 experiments in two days, 20 keep-worthy improvements, Time-to-GPT-2 from 2.02 hours to 1.80 hours (11% reduction) [32]. Within roughly ten days, the repo accumulated ~66k stars and 9.6k forks; the announcement tweet hit 8.6M views. Shopify CEO Tobi Lütke's reported 53% speedup of the Liquid templating engine via 93 autoresearch commits became the companion case [61].
On March 10, this survey's author published the Brain Augmentation manifesto, fixing the narrative anchor that "AI-era research = building an environment where an AI scientist keeps self-propagating knowledge" [63]. On March 16, Nature Medicine published The AI co-scientist is here, cataloguing in vivo / organoid / clinical follow-ups to Gottweis et al. 2025 [1]. On March 31, CUHK-AIM + Stanford + MSR + Lehigh posted Towards a Medical AI Scientist on arXiv:2603.28589 [69]. The same March saw the Agentic Researcher team publish arXiv:2603.15914 with a 5-level autonomy taxonomy [2] — the most direct precedent of this survey's six-level ladder.
3.2.7 2026-04 — the LLM Wiki gist + AAR + Codex 0.128 trifecta
The densest month on this survey's timeline. On April 4, Karpathy's LLM Wiki gist (karpathy/442a6bf555914893e9891c11519de94f) and launch tweet (16M+ views) landed [33]; the Hacker News front-page thread (item 47640875) appeared the same day [26]. Between April 7-10, Astro-Han's Agent-Skills package, ussumant's compiler framing, and ekadetov's Obsidian plugin all shipped in the same week [7]. On April 12, Karpathy's Farzapedia follow-up tweet defined the four properties — Explicit / Yours / Files-over-apps / BYOAI [35].
On April 14, Anthropic released AAR — 9 × Claude Opus 4.6 instances, 5 days, ~800 accumulated hours, ~$18k → PGR 0.97 vs human baseline 0.23 [5]. The honest caveat that Ch8 §8.4 will treat at paragraph level and that §3.4 below revisits as a case of G8 is in the same report — when the loop was moved to production-scale Claude Sonnet 4, no statistically significant gain was observed and reward-hacking was detected [18]. The same day, this survey's author published two synthesis posts on AAR + autoresearch [63]. On April 15, the Democratization of Research post became the narrative substrate for Ch1 and Ch9 [63].
Two events occurred together in the fourth week of April. Pilon et al. published RoboChem-Flex — a ~$5,000 modular self-driving lab with six chemistry case studies — in Nature Synthesis [49]. And on April 30, OpenAI released Codex CLI 0.128.0 with persisted /goal workflows + app-server APIs + worktrees + expanded permission profiles + an AGENTS.md spec update in a single drop [44]. Codex 0.128's /goal persistence was field-validated in May by the Tecton & Tide Six-Hour /goal Run That Survived a Five-Hour Pause report [60].
3.2.8 2026-05 — operational maturation (the current cutoff)
May settled into operational patterns. AgentUpdate.ai, Developers Digest, Ralphable, and Testing Catalog published 0.128 + April Codex release-cadence digests. AI Critique published an enterprise projection framing the LLM Wiki pattern as a procurement-side threat to Notion AI / Atlassian Rovo / Glean [3]. Nature News mid-May published the follow-up on Sakana v2's ICLR workshop acceptance + reviewer qualitative critiques [43]. Park Jaehong's GeekNews summary stabilized BYOAI / files-over-apps as the Korean-canonical framing [47]. This survey's cutoff is 2026-05-22; by the 2026-08 boundary, Codex's Remote Control roadmap and the Sakana / FutureHouse cadence will visibly change the picture.
3.2.9 Time weighting — why April is dense and the 2024-08 tail is thin
Reading the timeline quantitatively makes the survey's time-weighting decision transparent. Among the 146 canonical entries, the distribution is 6 entries from 2024, 11 from 2025, 80 from 2026 + 49 in the foundations shard. April 2026 is the densest single month (LLM Wiki + AAR + Codex 0.128 trifecta); 2025 follows (Sakana v2 + Co-Scientist + AI-Researcher + Paper2Agent); the 2024-08 tail (Sakana v1 + SciAgents + PaperQA2) is intentionally compact. The user-stated time weighting — 2026-04+ deepest, 2025-12+ medium, 2024-08+ thinnest tail — aligns with the field's actual publication density. Older parts (1945 Memex, 1981 BACON, 1998 Extended Mind, 2009 Adam, 2016 AlphaGo, 2021 AlphaFold, 2020 RAG, 2023 ReAct/Reflexion/ToT) live in the foundations shard and appear inside chapters only as lineage markers.
3.3 The six-level maturity ladder — where am I now
The six-level ladder of this survey is the intersection of three direct precedents — the Agentic Researcher 5-level autonomy taxonomy [2], Nature News's journalistic level framing [43], and this survey's author's three-stage democratization frame [63]. No single source ships this exact ladder. The novelty claim of the ladder is therefore limited — this survey did not invent it but assembled the intersection of three sources into one picture (G6, a 4-source synthesis is among this survey's most contestable novelty claims).
3.3.1 The ladder at a glance
| Level | Definition | Exemplar systems | Cost (1 cycle) | Time (1 cycle) | Exit criteria | Prerequisites |
|---|---|---|---|---|---|---|
| L0 | One-shot summarization — single-pass Q&A, no long-term memory | Generic ChatGPT/Claude.ai use | $0.01-0.50 | minutes | Feeling context window is too small when handling 5+ sources on the same topic | LLM account |
| L1 | Research Assistant — human supplies context every time; productivity up, no accumulation | Claude/Codex-based paper Q&A; PaperQA2 free-tier [38] | $0.50-5 | hours | Awareness of providing the same context repeatedly | API or ChatGPT Plus, one defined workflow |
| L2 | LLM Wiki — agent-maintained Markdown knowledge store; raw + wiki separated; researcher discusses with agent | Karpathy gist + 6 OSS (Astro-Han, lucasastorian, ussumant, ekadetov, OmegaWiki, Mcptube) [33] | $5-50 / week | weekly maintenance | 30+ concept/claim/contradiction pages accumulated; you check the wiki before search | git, Obsidian/VS Code, Claude Code or Codex CLI |
| L3 | Paper-to-Agent — convert key papers into callable MCP tools | Stanford Paper2Agent [59], PaperQA2 [38] | $50-500 / paper | 1-3 days / paper | "Apply this paper's methodology to our data" works in one prompt | L2 wiki + MCP knowledge + a paper with stable code |
| L4 | Agentic Research Associate — Codex /goal 6-hour autonomous run, autoresearch overnight | Codex 0.128 /goal [44], autoresearch on nanochat [30] | $50-500 / run | 6h-12h / run | A meaningful artifact emerges from 24h without human intervention | L2 wiki + 1-3 L3 tools + sandbox repo + AGENTS.md/CLAUDE.md |
| L5 | Dry-lab AI Scientist — closed loop: hypothesis → simulation/computation → analysis → next hypothesis | Sakana v1/v2 [41], autoresearch, AAR (dry-lab arm) [5] | $15-$18,000 / run | hours to days | New hypothesis from simulation/computation reaches publishable quality after human review | L4 + reviewer agent + evaluation framework + simulation/computational substrate |
| L6 | Wet-lab/Robot AI Scientist — hypothesis → experimental protocol → robot/lab automation → result → next | Co-Scientist AML follow-up [21], Pilon RoboChem-Flex [49], historical anchor: King Adam 2009 [36] | $5k hardware + $100-10k / experiment | days to weeks | Wet-lab results flow as a closed loop under PI approval | L5 + automated-lab infrastructure + safety protocols + human PI approval gate |
3.3.2 Commentary on each level
L0 (one-shot summarization). Where a general person stays when they first meet ChatGPT. Paste a PDF, get one summary. The limit is obvious — the next question needs the same context provided again. The ChatGPT seed §6 summarized this in one line as "no long-term memory, low research reproducibility" [14].
L1 (Research Assistant). The level where one starts using the API and defines one workflow. "Find OAuth-related sections in these 100 papers" becomes possible. PaperQA2's free-tier use is one example [38]. The limit is no knowledge accumulation — asking the same question on the same 100 papers requires paying the same cost again.
L2 (LLM Wiki). The level Ch4-Ch6 of this survey treats. Ingest the 100 papers once and read the wiki pages thereafter — no repeated cost for the same question. The Karpathy gist + 6 OSS are all implementations of this level [33]. The largest pitfall is wiki rot — only n=1+1 longitudinal literature exists, which is the survey's G1 [4]. Ch6 proposes schema-level countermeasures prescriptively.
L3 (Paper-to-Agent). The level Stanford Paper2Agent defined [59]. Pick 1-3 key papers and package their algorithms, metrics, dataset loaders, and benchmarks as MCP tools. "Apply this paper's methodology to our data" becomes one prompt. Per the survey's G7, this conversion works best when the paper already ships stable code — all three Stanford cases (AlphaGenome, ScanPy, TISSUE) had well-tested bioinformatics code. For the majority of papers without that, LLM Wiki page + executable notebook remains the default lighter pattern.
L4 (Agentic Research Associate). The level Codex 0.128's /goal workflow defined [44]. Six-hour autonomous runs, sandbox repositories, artifacts without human intervention. The Tecton & Tide Six-Hour /goal Run That Survived a Five-Hour Pause field report is the first real-world evidence the 0.128 persistence guarantees hold [60]. Karpathy's autoresearch is the same level specialized to an ML training loop [30]. The G12 disambiguation belongs here — autoresearch's engineering application (Shopify Liquid 53%) and research application (nanochat 11%) share the same code pattern but solve epistemically different tasks. L4 covers both applications; the exit to L5 happens only when the research-flavored hypothesis generation reaches verifiable quality.
L5 (Dry-lab AI Scientist). Sakana v1/v2, AAR (dry-lab arm), and the in-silico arm of Co-Scientist all sit here [41]. The cost range is the widest — Sakana v1 at $15/paper, DRA at $0.08/cycle, AAR at $18,000/run. The G3 caveat to emphasize at paragraph level lives here — AAR's PGR 0.97 was achieved on Claude Opus 4.6 in a research setting; on production-scale Claude Sonnet 4, no statistically significant gain was observed [5]. That the peak result of L5 is not production-transferable is itself the honest signal of the field.
L6 (Wet-lab/Robot AI Scientist). The top rung of the ladder, and simultaneously the level with the thinnest primary-source coverage (G9). Seventeen years of lineage since King's 2009 Adam [36], but in 2026 the LLM-driven L6 cases are basically RoboChem-Flex (chemistry, ~$5k modular hardware) [49], the AML follow-up to Co-Scientist + Guan et al.'s wet-lab replication [21], and Brazil 2026's Nature feature on the self-driving lab landscape [11]. This survey honestly frames L6 as a "preview tier whose case-study density is not comparable to L5" — what the tutorial reader (Ch10-12) can start today on a laptop is L2-L4, not L6. But the 17-year lineage itself underwrites L6's legitimacy — L6 is not a 2026 surprise but the most recent reincarnation of a long program that began with a 2009 existence proof.
3.3.3 The asymmetry of the ladder — why L2 is the thickest and L6 the thinnest
Reading the table quantitatively reveals one fact. This survey's timeline density and the per-level primary-source density of the ladder align exactly. L2 (LLM Wiki) is where more than 80% of the corpus concentrates (the 2026-04+ trifecta month + 6 OSS); L5 (Dry-lab AI Scientist) comes next (Sakana v1/v2 + AAR + autoresearch + Co-Scientist in silico); L3 (Paper-to-Agent) is one strand (Stanford Paper2Agent); L6 (Wet-lab AI Scientist) is the thinnest, supported by RoboChem-Flex + AML follow-up + the 17-year Adam lineage. The ladder's thickness reflects the field's maturity reality — L2 has entered mass-adoption while L6 remains preview tier.
This is a descriptive claim, not a normative one. Levels with thin coverage are not less valuable; they are simply less populated yet. L6's thin coverage is also the region the survey expects to thicken fastest after the 2026-05 horizon — that 2026-04+ is the densest part of the timeline is itself the strongest evidence that the next year's publications will concentrate in L4-L6.
3.4 G8 — metrics cannot be compared on one axis, and that fact is the maturity signal
After all the timeline and the ladder, one honest paragraph remains. The metrics across all the systems above — Sakana v1, v2, Co-Scientist, AI-Researcher, AAR, autoresearch, Deep Researcher Agent — cannot be placed on a single axis. Sakana v1 reports simulated-reviewer score + cost-per-paper; v2 reports ICLR workshop acceptance (n=1, qualitative reviewer comments); Co-Scientist reports 78.4% GPQA Diamond top-1 + expert preference + Elo + wet-lab IC50; AAR reports Performance Gap Recovered (PGR); AI-Researcher reports a hierarchical evaluation (full-spec vs sketch); autoresearch reports Time-to-GPT-2 reduction + keep-worthy improvement count; DRA reports $/24h cycle + cycle count.
Each metric is reasonable on its own. None of them is shared. GPQA measures graduate-level multiple-choice, not hypothesis quality. PGR is specialized to weak-to-strong supervision setups. ICLR workshop acceptance is n=1 and venue-specific. Simulated reviewer is the system's own LLM family — and as Schmidgall et al. showed exactly [53] — does not reliably perform novelty assessment of its own output. Elo is self-referential. Time-to-GPT-2 is tied to nanochat as one substrate.
This survey treats that fact not as a mere caveat but as the maturity signal of the field. The absence of head-to-head benchmarks across systems is itself the signal that the field has not yet entered its systemic comparison phase. This is G8. Popular-press attempts to line up "Sakana vs Co-Scientist vs AAR" treat metric-incomparable axes as if they sat in the same picture, and this survey refuses that comparison. Ch7 organizes systems by genealogy rather than benchmark ranking, and the six-level ladder marks evolutionary stages rather than benchmark grades.
How metric-comparability emerges in adjacent fields contextualizes where this sandbox sits. CV passed through ImageNet (2010s) before reaching shared metrics; NLP refined metrics along GLUE (2018) → SuperGLUE (2019) → MMLU (2020) → BIG-Bench (2022) → GPQA (2023) [23]. For the AI Scientist family to reach the same maturity will take time, and this survey at the 2026-05 cutoff names that the time has not yet come as one of its findings.
This framing is also a self-applied honesty against the six-level ladder. Each rung marks an evolutionary stage not a benchmark grade. "I am at L4" means "the pattern I am using is one rung more autonomous than L3," not "I am better than an L3 system." Head-to-head comparisons inside the same rung are not offered by this survey — Codex /goal vs Claude Code subagent user-experience rankings are not made. When a community-shared benchmark emerges after this survey's cutoff, ranking work can begin then, and the generalization of Schmidgall et al.'s honest-evaluation discipline ([53]) will be its first step.
3.5 How to read this chapter — Self-assess → L_current+1 → Part II / III / IV
The final task of this chapter is helping the reader self-assess on the ladder and pick the next step. The following three questions are the starting point.
- Have you, in the past month, provided the same context to an LLM twice on the same topic? Yes → you are at L0/L1, and the first task is L2 entry — saving the context once into a wiki. Go to Part II (Ch4-Ch6).
- Do you maintain a wiki or Markdown vault? Yes → you are at L2 or above. Next: does the LLM maintain it, or do you update it by hand? If the LLM maintains it, you are at L2; the next rung is L3 — turning one key paper into an MCP tool. Go to Part III (Ch7-Ch9).
- Have you ever completed a 6-hour autonomous run with Codex or Claude Code? Yes → you are at L4 or above. L5 entry — self-assess whether the run's output was a new hypothesis or verification of an existing one, and move to hypothesis-flavored work — is the next step. Part IV (Ch10-12) is a tutorial, but for L4+ readers it functions as an environment checklist.
The same questions can be asked at the institutional level. Where is our team? In our field (life sciences / chemistry / materials / clinical / ML itself), which level holds the most autonomous case? Ch9 of this survey provides domain-specific answers — Sakana/AAR/autoresearch in ML itself (L5), Co-Scientist's AML follow-up and Guan's liver-fibrosis replication in life sciences (the L5-L6 boundary), RoboChem-Flex in chemistry (L6's thin tier), SciAgents in materials (between L4 and L5). The primary intended use of this survey is therefore in two steps — locate yourself on Ch3, locate your domain's best case in Ch9, and the gap is your next year of work.
In one sentence: the timeline shows how 81 years of parts came together; the ladder shows where you are now and what the next rung is. Where the two axes meet, the survey's real work begins — building your own environment. From Ch4 the thickest stage, L2 LLM Wiki, begins in earnest.
References
- Adam, D. (2026). The AI co-scientist is here. Nature Medicine, 2026-03-16.
- Agentic Researcher (2026). The Agentic Researcher: A Practical Guide to AI-Assisted Research. arXiv:2603.15914.
- AI Critique (2026). Enterprise Knowledge — The LLM Wiki Threat to Notion AI, Atlassian Rovo, Glean. Substack, 2026-05-08.
- Aimaker (2026). 4-Month Obsidian + LLM Wiki Longitudinal Report. Aimaker blog.
- Anthropic (2026). Automated Alignment Researchers — Using LLMs to scale scalable oversight. Anthropic Research, 2026-04-14.
- Astorian, L. (2026). lucasastorian/llmwiki — MCP-based LLM Wiki service. GitHub.
- Astro-Han (2026). Astro-Han/karpathy-llm-wiki — Agent-Skills package. GitHub.
- Boiko, D. A., MacKnight, R., & Gomes, G. (2023). Emergent autonomous scientific research capabilities of large language models. arXiv:2304.05332; Nature, 2023.
- Bowman, S. R. et al. (2022). Measuring Progress on Scalable Oversight for Large Language Models. arXiv:2211.03540.
- Bran, A. M. et al. (2023). ChemCrow: Augmenting large-language models with chemistry tools. arXiv:2304.05376; Nature Machine Intelligence, 2024.
- Brazil, R. (2026). Inside the self-driving lab revolution. Nature, 2026-03-30.
- Burns, C. et al. (2023). Weak-to-Strong Generalization. arXiv:2312.09390; ICML 2024.
- Bush, V. (1945). As We May Think. The Atlantic, 1945-07.
- ChatGPT seed (2026). LLM Wiki → AI Scientist preliminary research synthesis (private session capture, 2026-05-22). Internal seed used by deep-researcher / critical-analyst / book-writer.
- Chamin, 0x (2026). Mcptube — YouTube-to-LLM-Wiki converter. GitHub.
- Chen, W. et al. (2023). AgentVerse: Facilitating Multi-Agent Collaboration. arXiv:2308.10848.
- Clark, A., & Chalmers, D. J. (1998). The Extended Mind. Analysis, 58(1), 7-19.
- Clark, J. (2026). Import AI 454 — Reading AAR carefully. Substack, 2026-04-20.
- Ekadetov (2026). ekadetov/llm-wiki — Obsidian plugin for Claude Code. GitHub.
- Ghafarollahi, A., & Buehler, M. J. (2024). SciAgents: Automating Scientific Discovery through Multi-Agent Intelligent Graph Reasoning. arXiv:2409.05556.
- Gottweis, J. et al. (2025). Towards an AI co-scientist. arXiv:2502.18864.
- Guan, J. et al. (2026). AI-Assisted Drug Re-Purposing for Human Liver Fibrosis. Advanced Science.
- Hendrycks, D. et al. (2020). Measuring Massive Multitask Language Understanding. arXiv:2009.03300; ICLR 2021.
- HKUDS (2025). AI-Researcher: Autonomous Scientific Innovation. arXiv:2505.18705; NeurIPS 2025.
- Hong, S. et al. (2023). MetaGPT: Meta Programming for Multi-Agent Collaborative Framework. arXiv:2308.00352.
- HN (2026). LLM Wiki front-page thread (item 47640875). Hacker News, 2026-04-04.
- Izacard, G. et al. (2022). Atlas: Few-shot Learning with Retrieval Augmented Language Models. arXiv:2208.03299; JMLR 2023.
- Jumper, J. et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596: 583-589.
- Karpathy, A. (2017). Software 2.0. Medium.
- Karpathy, A. (2026a). karpathy/autoresearch. GitHub.
- Karpathy, A. (2026b). Autoresearch first overnight run tweet. Twitter/X, 2026-03-07.
- Karpathy, A. (2026c). Autoresearch Round 1 tweet (700 / 2 days / 20 keep-worthy / 11% Time-to-GPT-2). Twitter/X, ~2026-03-09.
- Karpathy, A. (2026d). LLM Wiki gist (karpathy/442a6bf555914893e9891c11519de94f). GitHub Gist, 2026-04-04.
- Karpathy, A. (2026e). LLM Wiki launch tweet. Twitter/X, 2026-04-04.
- Karpathy, A. (2026f). Farzapedia follow-up thread (Explicit / Yours / Files-over-apps / BYOAI). Twitter/X, 2026-04-12.
- King, R. D. et al. (2009). The Automation of Science. Science, 324: 85-89.
- Langley, P. (1981). Data-Driven Discovery of Physical Laws. Cognitive Science, 5(1).
- Lála, J., White, A. D. et al. (2024). PaperQA2: Faster, better, free research agents. arXiv:2409.13740.
- Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401; NeurIPS 2020.
- Li, G. et al. (2023). CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society. arXiv:2303.17760; NeurIPS 2023.
- Lu, C. et al. (2024). The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery. arXiv:2408.06292.
- Madaan, A. et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. arXiv:2303.17651; NeurIPS 2023.
- Nature News (2026). How to build an AI scientist: first peer-reviewed paper spills the secrets. Nature.
- OpenAI (2026a). Codex CLI 0.128.0 changelog — persisted /goal workflows + app-server APIs + worktrees. OpenAI Developers.
- OpenAI (2026b). AGENTS.md specification update. GitHub.
- Packer, C. et al. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560; COLM 2024.
- Park, J. (2026). GeekNews — LLM Wiki Korean summary (BYOAI, files-over-apps). GeekNews.
- Park, J. S. et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. arXiv:2304.03442; UIST 2023.
- Pilon, T. et al. (2026). RoboChem-Flex: A ~$5,000 modular self-driving laboratory. Nature Synthesis.
- Rein, D. et al. (2023). GPQA: A Graduate-Level Google-Proof Q&A Benchmark. arXiv:2311.12022; COLM 2024.
- Schick, T. et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761; NeurIPS 2023.
- Schmidt, M., & Lipson, H. (2009). Distilling Free-Form Natural Laws from Experimental Data. Science 324(5923): 81-85.
- Schmidgall et al. (2025). Evaluating Sakana's AI Scientist for Autonomous Research. arXiv:2502.14297.
- Shen, Y. et al. (2023). HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. arXiv:2303.17580; NeurIPS 2023.
- Shinn, N. et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366; NeurIPS 2023.
- Silver, D. et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature 529: 484-489.
- skyllwt (2026). skyllwt/OmegaWiki — full-lifecycle AI research platform. GitHub.
- Srivastava, A. et al. (2022). Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models (BIG-Bench). arXiv:2206.04615.
- Stanford team (2025). Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents. arXiv:2509.06917.
- Tecton & Tide (2026). /goal: The Six-Hour Codex Run That Survived a Five-Hour Pause. Blog, 2026-05-01.
- The New Stack (2026). Andrej Karpathy's 630-line Python script ran 50 experiments overnight without any human. The New Stack, 2026-03.
- Um, T. (2025). Conductor — LLM Orchestration Patterns. terryum.ai, 2025-12.
- Um, T. (2026). Brain Augmentation; Democratization of Research; AAR synthesis; autoresearch synthesis; AI Co-Scientist synthesis. terryum.ai post collection, 2026-03 to 2026-05.
- ussumant (2026). ussumant/llm-wiki-compiler — Claude Code plugin. GitHub.
- Wang, G. et al. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv:2305.16291; TMLR 2024.
- Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903; NeurIPS 2022.
- Willison, S. (2026). Codex 0.128 — Persisted /goal Walkthrough. Personal blog, 2026-04-30.
- Wenhao Yu (2026). What Is Karpathy's LLM Wiki? A Zettelkasten User's Honest Review. Personal blog, 2026-04-20.
- Wu, H. et al. (2026). Towards a Medical AI Scientist. arXiv:2603.28589.
- Wu, Q. et al. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155.
- Yamada, Y. et al. (2025). The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search. arXiv:2504.08066.
- Yao, S. et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629; ICLR 2023.
- Yao, S. et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601; NeurIPS 2023.
- Zhang, X. (2026). Deep Researcher Agent: An Autonomous Framework for 24/7 Deep Learning Experimentation with Zero-Cost Monitoring. arXiv:2604.05854.