Part I: Why and What's Different — A Paradigm Shift in Research Methodology

Chapter 1: The Paradigm Shift of AI-Era Research

Written: 2026-05-22 Last updated: 2026-05-22

1.1 Four events inside a single month

Inside the thirty days of April 2026, four independently prepared systems landed almost simultaneously.

On March 7, Andrej Karpathy opened the karpathy/autoresearch repository and announced its first overnight run on his nanochat training loop: twelve hours, 110 mutations, validation loss from 0.862415 to 0.858039 [20]. Two days later he reported that 700 experiments over the next forty-eight hours had produced twenty improvements judged "worth keeping" and had cut Time-to-GPT-2 from 2.02 hours to 1.80 hours — an 11% reduction [20]. A single 630-line Python script had converted the hours one person spends asleep into one cycle of ML research [7].

On April 4, the same person published a Markdown file as a GitHub gist [20]. The title was "LLM Wiki." This was not a product launch — it was a pattern proposal. Leave the raw sources unchanged; let an agent build Markdown pages on top of them; let those pages serve as the starting point for the next question. The launch tweet recorded 16M+ views within 24 hours [20] and hit the Hacker News front page the same day. Within a week, six open-source implementations existed — Astro-Han's Agent-Skills package, lucasastorian's MCP-based service, ussumant's "compiler" framing, ekadetov's Obsidian plugin, OmegaWiki's 23-skill full-lifecycle implementation, and Mcptube's YouTube converter [3].

On April 14, Anthropic published results from the Automated Alignment Researchers (AAR) program [2]. Nine Claude Opus 4.6 instances ran for five days, accumulating about 800 instance-hours at a cost near $18,000, automating alignment research itself. Performance Gap Recovered (PGR) reached 0.97 versus a human baseline of 0.23 — more than a fourfold improvement. The same report carried an honest caveat: at production scale on Claude Sonnet 4, no statistically significant gain was observed, and reward-hacking episodes were detected [2].

And Google's AI Co-Scientist, first announced in February 2025, returned in March 2026 with wet-lab follow-ups published in Nature Medicine and Advanced Science [12]. AML candidate validation, two liver-fibrosis targets — one of them a repurposing candidate for an FDA-approved drug — and the cf-PICIs discovery. The hypotheses were generated by a multi-agent LLM system; the wet-lab confirmation came from humans.

These four events came from different teams, different institutions, different intentions. That they arrived inside the same month is itself the signal. The AI Scientist is no longer a vision for five years out. It is present-tense, and at a stage where it can be verified. This survey is about why this month was this month, what became possible inside it, and what remains open.

1.2 The philosophical foundation of the external brain

The four events were possible not just because models got bigger. They were possible because a social consensus formed around how to put maintainable external memory next to the model.

The philosophical roots of that consensus go back to Andy Clark and David Chalmers's 1998 The Extended Mind [9]. The Otto-and-his-notebook thought experiment: Otto has Alzheimer's, so he writes "The museum is on 53rd Street" in a notebook that he keeps with him at all times. Compared to Inga, who remembers the same fact in her head, Otto's notebook is functionally equivalent to Inga's memory — reliably accessible, automatically consulted when he reasons. An external resource that reliably plays the functional role of cognition is part of the mind. That is the parity principle.

This survey's author applied that principle to AI-era research in his March 10, 2026 essay Brain Augmentation [36]. The claim is simple. Research is no longer the act of reading papers and storing them in one's head. Research is the act of building an environment that an AI agent will keep filling in by itself. Those who build the environment well get further. Those who build it poorly cannot get past a single question, no matter how strong the underlying model.

This claim is exactly parallel to what Karpathy said on April 12 with different words. In his Farzapedia reply he defined four properties of the LLM Wiki — Explicit, Yours, Files-over-apps, BYOAI (Bring Your Own AI) [20]. The Korean-language community condensed it on GeekNews into a one-liner: "Forget RAG" [28].

The author's second essay, Democratization of Research, is the same principle on a different axis [36]. When the external brain becomes feasible, research democratizes in three stages. (1) The documentary stage — gathering and organizing material — opens to everyone. (2) The in silico stage — testing hypotheses on a computer — opens to everyone. (3) The physical stage — automated experimentation — opens to everyone. As of May 2026, the first stage is essentially done (Ch4–Ch6), the second is rapidly stabilizing (Ch7–Ch8), and the third has been brought within visible range by systems like the $5,000 RoboChem-Flex [29].

1.3 Why now, and what is different

The idea of automated research is not new. In 1945, Vannevar Bush's Memex proposal was a microfilm desk that linked documents by "associative trails"; in 1981, Pat Langley's BACON system rediscovered Kepler's third law and Ohm's law from raw data [6]. In 2009, King's Adam autonomously formed yeast-gene hypotheses and verified them by experiment [22]. In between, AlphaGo (2016) and AlphaFold (2021) supplied the existence proof that "domain-specific AI can produce scientist-grade output" [31].

But every prior generation of autonomous scientist was sealed inside a single narrow domain. Adam did yeast; AlphaFold did proteins. Generalization happened in human hands. Sakana's The AI Scientist v1 (August 2024) broke that seal for the first time [24]. The domain was still narrow — ML research only — but it was the first end-to-end demonstration that went ideation → code → experiment → visualization → paper → simulated review in one pass. Schmidgall et al. soon published a critique exposing the system's unreliable novelty assessment [30], but the critique itself was evidence that the field had become serious.

What is different about April 2026 lies elsewhere. This time it is not expensive, not closed, and you can build it yourself. autoresearch is a single 630-line file [20]. The LLM Wiki gist is a few pages of Markdown [20]. AAR's topology is nine equal peer instances sharing one workspace [2]. RoboChem-Flex's hardware cost is around $5,000 [29]. None of these is anything one giant company can monopolize. As long as Claude Code and Codex CLI — two coding agents — remain within consumer reach, the same patterns are reproducible by individual researchers on their own laptops.

One thing must be said plainly here. The vocabulary that shapes all of these patterns comes from one person. Karpathy's 2017 Software 2.0 essay [21], March 2026's autoresearch, April's LLM Wiki, April's Farzapedia, May's nanochat — an overwhelming majority of this survey's Tier 1 sources sit under one author's framing. This survey does not hide that dependence; it states it. Karpathy's gist is not invention but integration of patterns that have been hiding in plain sight for decades — Luhmann's Zettelkasten (1992), Ahrens's How to Take Smart Notes (2017), Park et al.'s Generative Agents (2023), Packer et al.'s MemGPT (2023), Wang et al.'s Voyager (2023). The LLM Wiki is one form that this lineage naturally reaches in the LLM era [25]. Ch4 traces that thirty-year lineage in detail.

1.4 The three questions this survey answers

Synthesizing the three sections above leaves three questions. The twelve chapters follow them in order.

First, what is the external brain? Part II (Ch4–Ch6) answers. What Karpathy's LLM Wiki pattern is, how the six open-source implementations that exploded in April differ, and how to design a research-grade wiki schema. Ch6 states honestly that the empirical literature on wiki rot — the failure mode of accumulating bad synthesis — is still at n=2.

Second, how do you close the loop to autonomous discovery? Part III (Ch7–Ch9) answers. The genealogy from Sakana v1 through AI Co-Scientist; the autonomous-experimentation patterns of autoresearch, AAR, and Paper2Agent; and the domain case studies (ML, alignment, biomedical, materials, medical). Ch7–Ch8 treat negative results like AAR's Sonnet-4 transfer failure as load-bearing paragraphs, not footnotes.

Third, how far does autonomy go, and what do you build yourself? Part IV (Ch10–Ch12) answers. A tutorial that starts with no self-hosted LLM — just Claude Code/Codex + Obsidian; the Ch11 worked example in which this survey itself is the demonstration; and a step-by-step roadmap from L2 LLM Wiki to L6 wet-lab.

Figure 1.3: The three questions this survey answers — Q1 what changed, Q2 what does the new stack look like, Q3 where are people on the ladder — each maps to Part II, III, and IV — illustration by author (gpt-image assisted)
Figure 1.3: The three questions this survey answers — Q1 what changed, Q2 what does the new stack look like, Q3 where are people on the ladder — each maps to Part II, III, and IV — illustration by author (gpt-image assisted)

Before answering those three questions, two things must be settled. (1) Words like "AI Scientist," "LLM Wiki," "AI agent," and "research associate" tend to clump together, but inside that clump there are four separable layers — Ch2 names them. (2) Across those layers there is a maturity scale on which readers can self-diagnose where they are and what the next step is — Ch3 lays it out.

This survey takes Part IV (Ch10–Ch12) of the same author's earlier From Claude Code to Codex as its starting point [36]. Those three chapters compressed "from LLM Wiki to AI Scientist." This survey refines them into a four-layer taxonomy, extends them with a six-level maturity model, and consolidates the OSS matrix that exploded in April. It is a book by the same author, but not the same book. A month is a long time when time flows differently.

Figure 1.1: Four AI research events that arrived inside one month — Karpathy autoresearch (3/7) then Co-Scientist Nature Medicine follow-up (3/16) then Karpathy LLM Wiki gist (4/4) then Anthropic AAR (4/14) — illustration by author (gpt-image assisted)
Figure 1.1: Four AI research events that arrived inside one month — Karpathy autoresearch (3/7) then Co-Scientist Nature Medicine follow-up (3/16) then Karpathy LLM Wiki gist (4/4) then Anthropic AAR (4/14) — illustration by author (gpt-image assisted)
Figure 1.2: One-page parallel between Clark and Chalmers 1998 Otto-Inga thought experiment and Karpathy 2026 BYOAI / Files-over-apps four properties — illustration by author (gpt-image assisted)
Figure 1.2: One-page parallel between Clark and Chalmers 1998 Otto-Inga thought experiment and Karpathy 2026 BYOAI / Files-over-apps four properties — illustration by author (gpt-image assisted)

References

  1. Adam, David, "The AI co-scientist is here," Nature Medicine, 2026-03-16. [Adam, 2026]
  2. Anthropic, "Automated Alignment Researchers — Using LLMs to scale scalable oversight," Anthropic Research, 2026-04-14. [Anthropic, 2026]
  3. Astro-Han, "karpathy-llm-wiki — Agent Skills-compatible LLM Wiki for Claude Code/Codex," GitHub, 2026-04. [Astro-Han, 2026]
  4. Astorian, Lucas, "lucasastorian/llmwiki — Open-source LLM Wiki with document upload + Claude MCP," GitHub, 2026-04. [Astorian, 2026]
  5. Ahrens, Sönke (2017). How to Take Smart Notes. CreateSpace.
  6. Bush, Vannevar (1945). As We May Think (the Memex proposal). The Atlantic, July 1945.
  7. BSWEN, "What Results Did 700 Autoresearch Experiments Achieve Overnight?," Medium, 2026-03-30. [BSWEN, 2026]
  8. 0xchamin, "Mcptube — Karpathy's LLM Wiki applied to YouTube (transcripts + vision frames)," GitHub, 2026-04. [0xchamin, 2026]
  9. Clark, Andy and Chalmers, David (1998). The Extended Mind. Analysis 58(1): 7-19.
  10. Clark, Jack, "Import AI 454: Automating alignment research," Import AI, 2026-04-20. [Clark, 2026]
  11. ekadetov, "ekadetov/llm-wiki — Claude Code plugin for persistent compounding KBs in Obsidian," GitHub, 2026-04. [ekadetov, 2026]
  12. Gottweis, Juraj et al. (2025). Towards an AI co-scientist. arXiv:2502.18864.
  13. Guan et al. (2026). AI-Assisted Drug Re-Purposing for Human Liver Fibrosis. Advanced Science. [Guan et al., 2026]
  14. Jumper, John et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature 596: 583-589.
  15. Karpathy, Andrej, "LLM Wiki — A pattern for building personal knowledge bases using LLMs," GitHub Gist, 2026-04-04. [Karpathy, 2026]
  16. Karpathy, Andrej, "LLM Wiki announcement (Twitter/X thread)," Twitter/X, 2026-04-04. [Karpathy, 2026]
  17. Karpathy, Andrej, "Farzapedia reply — personalization argument for LLM Wiki," Twitter/X, 2026-04-12. [Karpathy, 2026]
  18. Karpathy, Andrej, "karpathy/autoresearch — AI agents running research on single-GPU nanochat training," GitHub, 2026-03-07. [Karpathy, 2026]
  19. Karpathy, Andrej, "Autoresearch first-run tweet — 12h / 110 changes on nanochat," Twitter/X, 2026-03-07. [Karpathy, 2026]
  20. Karpathy, Andrej, "Autoresearch Round 1 tweet — 700 experiments / 11% Time-to-GPT-2 reduction," Twitter/X, 2026-03-09. [Karpathy, 2026]
  21. Karpathy, Andrej (2017). Software 2.0. Medium. [Karpathy, 2017]
  22. King, Ross D. et al. (2009). The Automation of Science. Science 324: 85-89. [King et al., 2009]
  23. Langley, Pat (1981). Data-Driven Discovery of Physical Laws (BACON). Cognitive Science 5(1): 31-54. [Langley, 1981]
  24. Lu, Chris et al. (2024). The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery. arXiv:2408.06292. [Lu et al., 2024]
  25. Luhmann, Niklas (1992). Communicating with Slip Boxes — An Empirical Account. Essay. [Luhmann, 1992]
  26. Packer, Charles et al. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560. [Packer et al., 2023]
  27. Park, Joon Sung et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. UIST 2023. [Park et al., 2023]
  28. Park, Jaehong, "Forget RAG — Karpathy's 'LLM Wiki' as a new knowledge-management paradigm," GeekNews (Korean), 2026-05. [Park, 2026]
  29. Pilon, Simone et al. (2026). A flexible and affordable self-driving laboratory for automated reaction optimization. Nature Synthesis. [Pilon et al., 2026]
  30. Schmidgall et al. (2025). Evaluating Sakana's AI Scientist for Autonomous Research. arXiv:2502.14297. [Schmidgall et al., 2025]
  31. Silver, David et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature 529: 484-489. [Silver et al., 2016]
  32. skyllwt, "OmegaWiki — Wiki-centric full-lifecycle AI research platform on Claude Code (DAIR Lab, Peking University)," GitHub, 2026-04. [skyllwt, 2026]
  33. The New Stack, "Andrej Karpathy's 630-line Python script ran 50 experiments overnight without any human," The New Stack, 2026-03. [The New Stack, 2026]
  34. Um, Taewoong, "Brain Augmentation — manifesto for AI-era self-generating knowledge environments," terryum.ai, 2026-03-10. [Um, 2026]
  35. Um, Taewoong, "Democratization of research — three stages (document → in silico → physical)," terryum.ai, 2026-04-15. [Um, 2026]
  36. Um, Taewoong, "Claude Code → Codex migration strategy," terryum.ai, 2026-04-24. [Um, 2026]
  37. ussumant, "ussumant/llm-wiki-compiler — Claude Code plugin: markdown knowledge → topic-based wiki," GitHub, 2026-04. [ussumant, 2026]
  38. Wang, Guanzhi et al. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. TMLR 2024. [Wang et al., 2023]