Part I: Why and What's Different — A Paradigm Shift in Research Methodology

Chapter 2: The Four-Layer Taxonomy — LLM Wiki / Paper-to-Agent / Research Associate / AI Scientist

Written: 2026-05-22 Last updated: 2026-05-22

2.1 Four things inside one word

April 2026 scrambled the vocabulary. "AI Scientist," "LLM Wiki," "AI agent," "research associate," "autonomous researcher," "co-scientist" — inside a single tweet thread these terms substitute for one another. Karpathy's gist [16], Anthropic's AAR report [5], Stanford's Paper2Agent paper [27], Google's AI Co-Scientist [9], Sakana v1 [20] — that these five primary sources do not sit comfortably under a single label is itself the reason a taxonomy is needed.

The claim of this chapter is simple. Inside what gets called "AI Scientist" there are four separable layers. Each layer does a different job, produces a different artifact, and was named by a different primary source. Build one well, and the next can sit on top of it; skip one, and the next always breaks.

Code Name Role Representative system (May 2026)
L_wiki LLM Wiki Knowledge compiler / external memory Karpathy gist + 6 OSS implementations [16]
L_p2a Paper-to-Agent Papers → callable tools Stanford Paper2Agent [27]
L_assoc Agentic Research Associate Executes coding, analysis, lit-review, reporting Codex 0.128 /goal, Claude Code [5]
L_scientist AI Scientist Hypothesis → experiment → analysis → review closed loop Sakana v1/v2, Co-Scientist, AAR, autoresearch [20]

This frame is scattered across the primary literature — no single source covers all four. Karpathy's gist covers L_wiki only [16]. Stanford's Paper2Agent covers L_p2a only [27]. The Agentic Researcher five-level autonomy taxonomy treats the autonomy axis inside L_assoc only [1]. Sakana, Co-Scientist, and AAR all cover L_scientist only.

The survey's primary intellectual contribution, then, is the assembly of these four layers into one frame. It is integration, not invention. The four-row table in ChatGPT seed §1 [30] was the starting point; this chapter aligns it with the primary literature.

2.2 The four definitions — one paragraph each

L_wiki — LLM Wiki: the knowledge compiler

LLM Wiki is the pattern of compiling a researcher's external cortex onto files. Raw sources are preserved (raw/); an agent maintains, on top of them, human-readable Markdown pages — concepts, claims, contradictions, open questions, source links — that accumulate over time. The output is not a database but a file hierarchy. Karpathy's April 4, 2026 gist [16] is the canonical naming, and the 16M+ views on the launch tweet [16], plus six open-source implementations within a week — Astro-Han's Agent-Skills package, lucasastorian's MCP-hosted service, ussumant's "compiler" framing, ekadetov's Obsidian plugin, OmegaWiki's 23-skill full-lifecycle build [26], and Mcptube's YouTube converter — are the distributed evidence that this layer is not a proposal but a working pattern. The decisive distinction is with RAG. RAG is query-time retrieval [19]; LLM Wiki is ingest-time and maintenance-time synthesis [16]. One retrieves; the other writes. Ch4–Ch6 cover this layer only.

L_p2a — Paper-to-Agent: conversion into callable capability

Paper-to-Agent is the pattern of converting a paper itself into an executable tool. Where LLM Wiki reads and organizes papers, Paper-to-Agent wraps the algorithms, metrics, dataset loaders, simulations, benchmarks, and experimental protocols inside a paper into MCP tools that programs can call. The output is not Markdown but executable code and an API on top of it. Stanford's Paper2Agent paper (arXiv:2509.06917) [27] is the canonical reference, with three case studies — AlphaGenome (genomics), ScanPy (single-cell analysis), TISSUE (spatial transcriptomics) — establishing that papers with well-maintained code can be auto-converted into MCP tools [3]. The reason this layer matters is that synthesizing and calling are different jobs. LLM Wiki answers "what does this paper claim?"; Paper-to-Agent answers "run this paper's method on our data." The interfaces differ. Ch8 covers this layer.

L_assoc — Agentic Research Associate: the research execution harness

Agentic Research Associate is the pattern of autonomously executing code, analysis, documents, and experiment protocols on top of human-defined research questions. The output is neither Markdown nor a tool but traces of executed work — TODO.md updates, report.md drafts, accumulated results/, dry-run logs, post-mortems of failed experiments. In 2026, this layer's substrate is two CLI coding agents — Claude Code [5] and Codex CLI [21]. Codex 0.128's /goal is the representative primitive. The Tecton & Tide six-hour autonomous run — during which five hours passed without a human present — [28] is the most legible field-grade evidence that this layer works, and Simon Willison's 0.128 review [32] formalized the primitive for an English-speaking audience. The Agentic Researcher group's five-level autonomy taxonomy [1] is the taxonomic anchor for this layer — Ch3 extends it into a six-level model. This survey is itself a worked example of this layer; Ch11 covers it in detail.

L_scientist — AI Scientist: closed-loop autonomous discovery

AI Scientist is the pattern of running hypothesis generation → experiment design → execution → result analysis → next-experiment selection as a closed loop inside a human approval gate. The output is discovery itself — new hypotheses, in silico re-derived mechanisms, wet-lab–confirmed candidate drugs. Sakana v1 (August 2024) [20] is the canonical naming. Though limited to ML, it was the first system to bundle ideation → code → experiment → visualization → paper → simulated review end-to-end. After it came Sakana v2 [33], Google AI Co-Scientist [9], HKUDS AI-Researcher [10], Anthropic AAR [5], Karpathy autoresearch [16], and Zhang's Deep Researcher Agent [34], each carrying the pattern into a different domain — alignment, biomedical hypothesis generation, ML training optimization, clinical decision-making. Schmidgall et al.'s critical evaluation [24] and AAR's Sonnet-4 transfer failure [5] situate this layer not as a finished system but as a young field whose verification infrastructure is still catching up. Ch7–Ch9 cover this layer only.

2.3 The four layers side by side

Dimension L_wiki L_p2a L_assoc L_scientist
Core activity Synthesis Conversion Execution Discovery
Primary artifact Markdown wiki vault MCP tool / callable module Code, report.md, experiment logs New hypotheses, in silico results, papers
Time unit Ingest-time + maintenance-time One conversion per paper Minutes-to-hours per task Hours-to-days per cycle
Human role Curator, reviewer Domain expert, tool user PM, code reviewer PI, approval gate
Naming date 2026-04-04 [16] 2025-09 [27] 2026-04 [1] 2024-08 [20]
Verification infrastructure None (Ch5 G2) Three Paper2Agent cases Five-level taxonomy + field reports Schmidgall critique + 1 workshop paper + wet-lab validation
Human-AI interface Curation + direct editing API call CLI + slash command Approval gate + co-reasoning
Cost (representative) Operator time + model API One model run per conversion $0.50–$5 per task $0.08/cycle [34] ~ $18k/run [5]
Most common misreading "Another name for RAG" "A subset of LLM Wiki" "A weaker AI Scientist" "One giant prompt"

This table is the chapter's spine. The four columns do different work — synthesis, conversion, execution, discovery. They produce different artifacts, operate on different time scales, and have different verification infrastructure. And most importantly: each sits under a different kind of misreading. That last row is the cleanest argument for why the taxonomy is needed.

2.4 Why four and not three — versus existing attempts

The most non-obvious decision in the taxonomy is breaking out Paper-to-Agent as its own layer. Contrast with three nearby attempts.

Agentpedia's "three-layer OS" [2] cuts the same space into LLM Wiki + Agentic Researcher + AI Scientist. Its weakness is straightforward. Stanford's Paper2Agent was published in September 2025 [27]; when the Agentpedia piece appeared in April 2026, Paper2Agent had already been covered by InfoQ [11] and HPCwire [3]. Agentpedia nevertheless absorbs Paper-to-Agent into LLM Wiki — treating it as a kind of "paper-organizing" inside the wiki. The arrangement collapses on the question: which tool calls AlphaGenome? Not the wiki — wikis yield Markdown pages, not callable APIs.

Umberto's three-stage democratization [30] — documentary / in silico / physical — is an orthogonal cut: a maturity axis, not a layer axis. The documentary stage can contain L_wiki + L_p2a + L_assoc; in silico contains L_assoc + dry-lab L_scientist; physical contains wet-lab L_scientist. This survey separates the axes — Ch2's four layers are what is being done; Ch3's six levels are how autonomously.

Claude-to-Codex Part IV [30] is the survey's predecessor. Ch10 covered LLM Wiki, Ch11 the personal worked example, Ch12 the AI Scientist cases, but it did not name a four-layer taxonomy. That book's frame was "Claude Code → Codex migration," and the layers were subordinate to it. This survey changes the frame — not a migration story but a map of the field itself. So the layers must be named explicitly.

Agentic Researcher's five-level autonomy [1] is yet another axis. It slices L0 (full human control) → L4 (high agent autonomy) on the autonomy dimension. Where does it sit in our taxonomy? — inside L_assoc. The Agentic Researcher taxonomy refines the internal autonomy of L_assoc; it does not cover L_wiki or L_scientist.

Three prior attempts each cut the same space along a different axis. The result is clear: a formal layer-axis taxonomy is missing. This chapter fills the gap.

A second reason for breaking out Paper-to-Agent shows up in the boundary cases. Where would one classify the result of wrapping AlphaGenome as an MCP tool? Calling it L_wiki means a page about using AlphaGenome ends up in the wiki — but AlphaGenome itself does not become callable. Calling it L_scientist means a discovery loop uses AlphaGenome — but the loop is not creating AlphaGenome. Calling it L_assoc means analysis uses AlphaGenome — but creating the tool is not analysis. That all four standard answers feel awkward forces the fifth answer: Paper-to-Agent is its own layer [27].

2.5 Data flow between layers

The four layers are rarely isolated in practice. Real workflows are flows between layers.

L_wiki → L_assoc: LLM Wiki serves as the long-term memory of the research associate. Claude Code reading CLAUDE.md at each session start [5] and Codex following the AGENTS.md spec [21] are the protocols of this flow. wiki/concepts/ pages enter the prompt context; the agent's results flow back into wiki/log.md and wiki/claims/. The wiki is not read-only — it is a shared workspace the agent updates alongside the human.

L_wiki → L_p2a: a claim page in the wiki ("this paper defines algorithm X") becomes a task ("make this algorithm callable"). Paper2Agent's automated conversion pipeline [27] is the automated form of this flow.

L_p2a → L_scientist: the most interesting flow, and the source of the boundary ambiguity Ch8 names G7. When an AI Scientist system decides at the verification stage to "evaluate this hypothesis using AlphaGenome," the call itself is a use of a Paper-to-Agent output. Co-Scientist's GPQA-diamond evaluation [9], SciAgents' KG-based reasoning [8], and PaperQA2's literature synthesis [18] all depend on tools extracted from papers. L_scientist's autonomy scales with the richness of L_p2a.

L_assoc → L_scientist: the code, analyses, and failed experiments accumulated by the research associate become priors for the AI Scientist. Karpathy's autoresearch [16] runs on top of the nanochat repository, which Karpathy authored in research-associate mode. AAR's nine Opus 4.6 instances work on a shared memory called forum-scratch — forum-scratch is the accumulated artifact of L_wiki + L_assoc [5].

L_scientist → L_wiki: the closing edge of the closed loop. New hypotheses, results, and failure analyses produced by the AI Scientist flow back into the wiki as priors for the next cycle. Voyager's skill library [31], MemGPT's hierarchical memory [22], and Generative Agents' memory stream [23] are the academic ancestors of this flow.

Flow Interface Naming date
L_wiki → L_assoc CLAUDE.md / AGENTS.md 2024-12 / 2026-04 [5]
L_wiki → L_p2a Paper2Agent auto-extraction 2025-09 [27]
L_p2a → L_scientist MCP tool call 2024-11 (MCP spec)
L_assoc → L_scientist Shared repo / forum-scratch 2026-04 [5]
L_scientist → L_wiki Claim page update 2023 (Voyager ancestor) [31]

When all five flows run simultaneously, the survey calls the result a "research OS": a bundle in which the human poses a hypothesis, the AI Scientist proposes experiments, the research associate writes code, Paper-to-Agent tools are called, and the LLM Wiki accumulates results. Ch12 lays out the five-step roadmap for constructing this bundle.

2.6 The boundaries are not crisp

A taxonomy is a tool, not a truth. This section names where the four-layer cut is not clean.

Boundary 1: a paper page in the wiki vs. Paper-to-Agent. If a wiki page covers a paper exhaustively — algorithm pseudocode, I/O spec, dataset links, reproduction environment — how does it differ from a Paper-to-Agent output? The survey's answer is callability. A wiki page is read by humans; a Paper-to-Agent output is called by programs. The boundary blurs gradually as code snippets embedded in wiki pages grow richer. Ch8 §8.6 (gap G7) treats this in detail.

Boundary 2: Research Associate vs. AI Scientist. Was Codex /goal's six-hour autonomous run [28] L_assoc or L_scientist? The survey's answer is whether the discovery loop is closed. /goal autonomously executes a human-defined task — the discovery loop is not closed. Karpathy's autoresearch [16] runs on the same substrate (a coding agent) but its loop — hypothesis → mutation → evaluation → next mutation — is closed. That both layers are accessible on the same tool is the interesting fact. The layer is determined not by tooling but by whether the loop is closed.

Boundary 3: AI Scientist vs. domain-specific ML system. Is AlphaFold [13] L_scientist? Is AlphaGo [25]? The answer is whether a generalizable autonomous discovery loop exists. AlphaFold is sealed inside one domain — humans designed the discovery stages; AlphaFold performs one stage (prediction). L_scientist autonomously selects the discovery stages themselves. The boundary blurs over time. King et al.'s Adam (2009) [17] autonomously generated yeast-gene hypotheses and verified them in a wet lab — domain-specific, but its discovery loop was closed. Whether Adam is L_scientist is an open question revisited in Ch7–Ch9.

Boundary 4: LLM Wiki vs. strong RAG. Atlas [12] and PaperQA2 [18] are sophisticated retrieval-augmented LMs. How do they differ from LLM Wiki? The survey's answer is the form and authorship of the knowledge. RAG retrieves chunks but does not store the synthesized result. LLM Wiki produces synthesis as the output, persisted as human-readable Markdown files. The difference is one of kind, not degree — but Denser's vendor counter-take [7] argues "strong RAG" can do the same work. Ch4 treats this debate honestly.

The survey does not hide the blur as a weakness. The taxonomy is not 100% clean; it is 90% clean; and it names where the remaining 10% blur lives. That is the most a taxonomy-as-tool can claim.

Figure 2.3: The taxonomy is a tool — the four layers can be decomposed and then composed back into domain-specific research workflows — illustration by author (gpt-image assisted)
Figure 2.3: The taxonomy is a tool — the four layers can be decomposed and then composed back into domain-specific research workflows — illustration by author (gpt-image assisted)

2.7 Onward to Ch3 — timeline and six-level maturity

The four-layer frame is a spatial taxonomy — what is being done. Ch3 adds a temporal-developmental taxonomy — within the same layer, systems can be ranked on autonomy across six levels, L0 (humans control every step) → L6 (autonomous wet-lab). It adopts the Agentic Researcher five-level taxonomy [1] and extends it with L0 (the LLM-use baseline) plus the L5/L6 split (dry-lab vs. wet-lab).

Ch3 also walks the 21 months from Sakana v1 (August 2024) to May 2026 in chronological order, locating when and where each of this chapter's four layers was named. The four-layer × six-level grid is that chapter's output.

To close: the four layers are L_wiki — knowledge synthesis (Karpathy gist + 6 OSS), L_p2a — conversion into tools (Stanford Paper2Agent), L_assoc — code/analysis execution (Codex /goal + Claude Code), L_scientist — closed-loop discovery (Sakana → Co-Scientist → AAR → autoresearch). That these can be separated is one contribution of the survey; that having separated them they can be composed is the second. Ch4–Ch6 cover L_wiki only, Ch7–Ch9 cover L_p2a + L_scientist, and Ch10–Ch12 cover composition and operation of all four.

Figure 2.1: Four-layer card diagram — L_wiki (knowledge synthesis), L_p2a (conversion to tools), L_assoc (execution), L_scientist (closed-loop discovery). Each card shows core activity, representative system, naming date — illustration by author (gpt-image assisted)
Figure 2.1: Four-layer card diagram — L_wiki (knowledge synthesis), L_p2a (conversion to tools), L_assoc (execution), L_scientist (closed-loop discovery). Each card shows core activity, representative system, naming date — illustration by author (gpt-image assisted)
Figure 2.2: Inter-layer data flows — five arrows with their protocol labels (CLAUDE.md, AGENTS.md, MCP, forum-scratch, claim-update) — illustration by author (gpt-image assisted)
Figure 2.2: Inter-layer data flows — five arrows with their protocol labels (CLAUDE.md, AGENTS.md, MCP, forum-scratch, claim-update) — illustration by author (gpt-image assisted)

References

  1. Agentic Researcher, "The Agentic Researcher: A Practical Guide to AI-Assisted Research," arXiv:2603.15914, 2026. [Agentic Researcher, 2026]
  2. Agentpedia, "Karpathy's LLM Wiki: The Complete Guide to His Idea File," Agentpedia, 2026. [Agentpedia, 2026]
  3. AIwire, "Stanford's Paper2Agent Reimagines Scientific Papers as Interactive AI Agents," HPCwire AIwire, 2025-10-10. [AIwire, 2025]
  4. Anthropic, "Automated Alignment Researchers — Using LLMs to scale scalable oversight," Anthropic Research, 2026-04-14. [Anthropic, 2026]
  5. Anthropic, "Claude Code memory + subagent documentation," Anthropic Docs, 2026. [Anthropic, 2026]
  6. Clark, Jack, "Import AI 454: Automating alignment research," Import AI, 2026-04-20. [Clark, 2026]
  7. Denser.ai, "From RAG to LLM Wiki: What Karpathy's idea means for AI knowledge bases," Denser.ai Blog, 2026. [Denser, 2026]
  8. Ghafarollahi, Alireza et al. (2024). SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning. arXiv:2409.05556. [Ghafarollahi et al., 2024]
  9. Gottweis, Juraj et al. (2025). Towards an AI co-scientist (Google AI Co-Scientist). arXiv:2502.18864. [Gottweis et al., 2025]
  10. HKUDS (2025). AI-Researcher: Autonomous Scientific Innovation. arXiv:2505.18705. [HKUDS, 2025]
  11. InfoQ, "Paper2Agent Converts Scientific Papers into Interactive AI Agents," InfoQ, 2025-10. [InfoQ, 2025]
  12. Izacard, Gautier et al. (2022). Atlas: Few-shot Learning with Retrieval Augmented Language Models. arXiv:2208.03299. [Izacard et al., 2022]
  13. Jumper, John et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature 596: 583-589. [Jumper et al., 2021]
  14. Karpathy, Andrej, "LLM Wiki — A pattern for building personal knowledge bases using LLMs," GitHub Gist, 2026-04-04. [Karpathy, 2026]
  15. Karpathy, Andrej, "LLM Wiki announcement (Twitter/X thread)," Twitter/X, 2026-04-04. [Karpathy, 2026]
  16. Karpathy, Andrej, "karpathy/autoresearch — AI agents running research on single-GPU nanochat training," GitHub, 2026-03-07. [Karpathy, 2026]
  17. King, Ross D. et al. (2009). The Automation of Science. Science 324: 85-89. [King et al., 2009]
  18. Lala, J. et al. (2024). PaperQA2 — Language agents achieve superhuman synthesis of scientific knowledge. arXiv:2409.13740. [Lala et al., 2024]
  19. Lewis, Patrick et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. [Lewis et al., 2020]
  20. Lu, Chris et al. (2024). The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery. arXiv:2408.06292. [Lu et al., 2024]
  21. OpenAI, "Codex /goal Command," Ralphable, 2026. [OpenAI, 2026]
  22. Packer, Charles et al. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560. [Packer et al., 2023]
  23. Park, Joon Sung et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. UIST 2023. [Park et al., 2023]
  24. Schmidgall et al. (2025). Evaluating Sakana's AI Scientist for Autonomous Research. arXiv:2502.14297. [Schmidgall et al., 2025]
  25. Silver, David et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature 529: 484-489. [Silver et al., 2016]
  26. skyllwt, "OmegaWiki — Wiki-centric full-lifecycle AI research platform on Claude Code (DAIR Lab, Peking University)," GitHub, 2026-04. [skyllwt, 2026]
  27. Stanford team (2025). Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents. arXiv:2509.06917. [Stanford, 2025]
  28. Tecton & Tide, "/goal: The Six-Hour Codex Run That Survived a Five-Hour Pause," Tecton & Tide Blog, 2026-04. [Tecton & Tide, 2026]
  29. Um, Taewoong, "Democratization of research — three stages (document → in silico → physical)," terryum.ai, 2026-04-15. [Um, 2026]
  30. Um, Taewoong, "Claude Code → Codex 이관 전략," terryum.ai, 2026-04-24. [Um, 2026]
  31. Wang, Guanzhi et al. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. TMLR 2024. [Wang et al., 2023]
  32. Willison, Simon, "Codex CLI 0.128.0 adds /goal," Simon Willison's Blog, 2026-04-30. [Willison, 2026]
  33. Yamada, Yutaro et al. (2025). The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search. arXiv:2504.08066. [Yamada et al., 2025]
  34. Zhang, Xiangyue (2026). Deep Researcher Agent: An Autonomous Framework for 24/7 Deep Learning Experimentation. arXiv:2604.05854. [Zhang, 2026]