Part I: Why and What's Different — A Paradigm Shift in Research Methodology

Chapter 2: The Four-Layer Taxonomy — LLM Wiki / Paper-to-Agent / Research Associate / AI Scientist

Written: 2026-05-22 Last updated: 2026-05-22

2.1 Four things inside one word

April 2026 scrambled the vocabulary. "AI Scientist," "LLM Wiki," "AI agent," "research associate," "autonomous researcher," "co-scientist" — inside a single tweet thread these terms substitute for one another. Karpathy's gist ^[16], Anthropic's AAR report ^[5], Stanford's Paper2Agent paper ^[27], Google's AI Co-Scientist ^[9], Sakana v1 ^[20] — that these five primary sources do not sit comfortably under a single label is itself the reason a taxonomy is needed.

The claim of this chapter is simple. Inside what gets called "AI Scientist" there are four separable layers. Each layer does a different job, produces a different artifact, and was named by a different primary source. Build one well, and the next can sit on top of it; skip one, and the next always breaks.

Code	Name	Role	Representative system (May 2026)
L_wiki	LLM Wiki	Knowledge compiler / external memory	Karpathy gist + 6 OSS implementations ^[16]
L_p2a	Paper-to-Agent	Papers → callable tools	Stanford Paper2Agent ^[27]
L_assoc	Agentic Research Associate	Executes coding, analysis, lit-review, reporting	Codex CLI `/goal`, Claude Code ^[5]
L_scientist	AI Scientist	Hypothesis → experiment → analysis → review closed loop	Sakana v1/v2, Co-Scientist, AAR, autoresearch ^[20]

This frame is scattered across the primary literature — no single source covers all four. Karpathy's gist covers L_wiki only ^[16]. Stanford's Paper2Agent covers L_p2a only ^[27]. The Agentic Researcher five-level autonomy taxonomy treats the autonomy axis inside L_assoc only ^[1]. Sakana, Co-Scientist, and AAR all cover L_scientist only.

The survey's primary intellectual contribution, then, is the assembly of these four layers into one frame. It is integration, not invention. The four-row table in ChatGPT seed §1 ^[30] was the starting point; this chapter aligns it with the primary literature.

2.2 The four definitions — one paragraph each

L_wiki — LLM Wiki: the knowledge compiler

LLM Wiki is the pattern of compiling a researcher's external cortex onto files. Raw sources are preserved (raw/); an agent maintains, on top of them, human-readable Markdown pages — concepts, claims, contradictions, open questions, source links — that accumulate over time. The output is not a database but a file hierarchy. Karpathy's April 4, 2026 gist ^[16] is the canonical naming, and the 16M+ views on the launch tweet ^[16], plus six open-source implementations within a week — Astro-Han's Agent-Skills package, lucasastorian's MCP-hosted service, ussumant's "compiler" framing, ekadetov's Obsidian plugin, OmegaWiki's 23-skill full-lifecycle build ^[26], and Mcptube's YouTube converter — are the distributed evidence that this layer is not a proposal but a working pattern.

The decisive distinction is with RAG. RAG also ingests documents in advance, but that ingest step is mostly indexing for later retrieval. When a user asks a question, the system retrieves relevant chunks from a vector database and the LLM synthesizes an answer at query time ^[19]. LLM Wiki uses ingest differently: the agent reads the raw source and writes human-readable pages such as concepts/, claims/, contradictions/, and open_questions/. When new sources arrive later, the agent merges them into existing pages, removes duplication, flags conflicting claims, and fixes broken links or unsupported statements. That recurring upkeep is maintenance-time synthesis ^[16]. For example, if the question is "How does Sakana AI's AI Scientist connect to Karpathy's AutoResearch?", RAG searches for relevant chunks and interprets the connection in that moment. LLM Wiki leaves the connection behind in pages such as wiki/concepts/ai-scientist.md, wiki/comparisons/ai-scientist-vs-autoresearch.md, and wiki/open_questions/evaluation-of-agentic-science.md, so later questions can reuse the accumulated artifact. The basic verb of RAG is retrieve; the basic verb of LLM Wiki is write. Ch4–Ch6 cover this layer only.

L_p2a — Paper-to-Agent: conversion into callable capability

Paper-to-Agent is the pattern of converting a paper itself into an executable tool. Where LLM Wiki reads and organizes papers, Paper-to-Agent wraps the algorithms, metrics, dataset loaders, simulations, benchmarks, and experimental protocols inside a paper into MCP tools that programs can call. The output is not Markdown but executable code and an API on top of it. Stanford's Paper2Agent paper (arXiv:2509.06917) ^[27] is the canonical reference, with three case studies — AlphaGenome (genomics), ScanPy (single-cell analysis), TISSUE (spatial transcriptomics) — establishing that papers with well-maintained code can be auto-converted into MCP tools ^[3].

This is the distinction from Papers2Code-style "paper with code" infrastructure. Papers2Code presents itself as an open-source community platform for discovering and collaborating on 500K+ paper implementations ^[34]. The main artifact of that layer is a link between a paper and an implementation repository. To use it in research, a human still has to find the code, install it, understand its I/O, and wrap it around local data. Paper2Agent moves one step downstream: it analyzes the paper and associated codebase, builds an MCP server, exposes algorithms and datasets as tools, and iteratively generates tests to harden the interface. So the claim is not "drop in any paper and a complete API appears." More precisely, when a paper has verifiable code, Paper2Agent upgrades paper-with-code into paper-as-tool. Papers without code still leave the validation burden of generated code or notebooks. The reason this layer matters is that synthesizing and calling are different jobs. LLM Wiki answers "what does this paper claim?"; Paper-to-Agent answers "run this paper's method on our data." The interfaces differ. Ch8 covers this layer.

L_assoc — Agentic Research Associate: the research execution harness

Agentic Research Associate is the pattern of taking a human-defined research goal and executing the coding, analysis, documents, and experiment protocols needed to move it forward. It is not yet an AI Scientist that invents a new research agenda inside a closed loop. The human gives a bounded goal such as "refresh the related-work table for this paper set," "run this ablation and summarize the result in report.md," or "read the failed experiment logs and propose the next fix." The agent plans inside the repository, edits files, runs scripts and tests, and leaves updates in TODO.md, report.md, results/, dry-run logs, and post-mortems. The output is not a single summary paragraph; it is the trace of work actually performed.

In 2026, this layer's representative substrate is two terminal-based coding agents — Claude Code ^[5] and Codex CLI ^[21]. Here "Codex 0.128" refers to the Codex CLI 0.128.0 release from April 30, 2026. The important feature in that release was the /goal long-horizon command ^[32]. /goal lets a user give one larger objective, then lets the agent continue through many steps of editing, running, checking, and preserving state. In this context, the version number is less important than the behavior it exposed: a human defines the goal, and the agent moves a research repository for hours until it leaves a concrete artifact. The Tecton & Tide six-hour autonomous run — during which five hours passed without a human present — is the clearest field-grade evidence that this pattern works in real use ^[28]. The Agentic Researcher group's five-level autonomy taxonomy ^[1] explains the autonomy axis inside this layer; Ch3 extends it into a six-level model. This survey is itself a worked example of the pattern, because its literature updates, drafts, builds, and review artifacts were accumulated in that mode; Ch11 covers it in detail.

L_scientist — AI Scientist: closed-loop autonomous discovery

AI Scientist is the pattern of running hypothesis generation → experiment design → execution → result analysis → next-experiment selection as a closed loop inside a human approval gate. The output is discovery itself — new hypotheses, in silico re-derived mechanisms, wet-lab–confirmed candidate drugs. Sakana v1 (August 2024) ^[20] is the canonical naming. Though limited to ML, it was the first system to bundle ideation → code → experiment → visualization → paper → simulated review end-to-end. After it came Sakana v2 ^[33], Google AI Co-Scientist ^[9], HKUDS AI-Researcher ^[10], Anthropic AAR ^[5], Karpathy autoresearch ^[16], and Zhang's Deep Researcher Agent ^[35], each carrying the pattern into a different domain — alignment, biomedical hypothesis generation, ML training optimization, clinical decision-making. Schmidgall et al.'s critical evaluation ^[24] and AAR's Sonnet-4 transfer failure ^[5] situate this layer not as a finished system but as a young field whose verification infrastructure is still catching up. Ch7–Ch9 cover this layer only.

2.3 The four layers side by side

Dimension	L_wiki	L_p2a	L_assoc	L_scientist
Core activity	Synthesis	Conversion	Execution	Discovery
Primary artifact	Markdown wiki vault	MCP tool / callable module	Code, report.md, experiment logs	New hypotheses, in silico results, papers
Time unit	Ingest-time + maintenance-time	One conversion per paper	Minutes-to-hours per task	Hours-to-days per cycle
Human role	Curator, reviewer	Domain expert, tool user	PM, code reviewer	PI, approval gate
Naming date	2026-04-04 ^[16]	2025-09 ^[27]	2026-04 ^[1]	2024-08 ^[20]
Verification infrastructure	None (Ch5 G2)	Three Paper2Agent cases	Five-level taxonomy + field reports	Schmidgall critique + 1 workshop paper + wet-lab validation
Human-AI interface	Curation + direct editing	API call	CLI + slash command	Approval gate + co-reasoning
Cost (representative)	Operator time + model API	One model run per conversion	$0.50–$5 per task	$0.08/cycle ^[35] ~ $18k/run ^[5]
Most common misreading	"Another name for RAG"	"A subset of LLM Wiki"	"A weaker AI Scientist"	"One giant prompt"

This table is the chapter's spine. The four columns do different work — synthesis, conversion, execution, discovery. They produce different artifacts, operate on different time scales, and have different verification infrastructure. And most importantly: each sits under a different kind of misreading. That last row is the cleanest argument for why the taxonomy is needed.

Figure 2.1: Four-layer card diagram — L_wiki (knowledge synthesis), L_p2a (conversion to tools), L_assoc (execution), L_scientist (closed-loop discovery). Each card shows core activity, representative system, naming date — illustration by author (gpt-image assisted)

2.4 Why four and not three

The key choice in this taxonomy is separating Paper-to-Agent from LLM Wiki. The reason is simple. LLM Wiki reads and organizes papers. Paper-to-Agent turns a method inside a paper into a callable tool. Both start from papers, but their outputs are different. One is a page for humans to read; the other is an API for programs to call.

This is where nearby taxonomies blur. Agentpedia's three-layer OS cuts the space into LLM Wiki + Agentic Researcher + AI Scientist, absorbing Paper-to-Agent into the wiki layer ^[2]. But consider an AlphaGenome MCP wrapper. A wiki can contain an "how to use AlphaGenome" page; it does not make AlphaGenome itself callable. Making the tool is neither summarization nor experimentation. That is why the layer needs its own name ^[27].

The other confusion comes from mixing the layer axis with the maturity axis. Documentary / in silico / physical stages ^[30] and Agentic Researcher's autonomy levels ^[1] ask "how autonomous is the system?" This chapter's four layers ask "what job is being done?" The two axes can be combined, but one does not replace the other.

2.5 Data flow between layers

In practice the four layers rarely run in isolation. LLM Wiki often becomes the long-term memory; the Research Associate reads that memory and updates code or analysis; a wiki claim can become a Paper-to-Agent task; and the resulting tool can be called inside an AI Scientist loop.

For example, a researcher first organizes a paper set in an LLM Wiki. Then Codex or Claude Code reads AGENTS.md or CLAUDE.md and writes analysis code ^[5]. One key paper becomes an MCP tool through Paper2Agent ^[27]. Later, an AI Scientist loop decides, "evaluate this hypothesis with that tool." The result and failure analysis flow back into the wiki as claims and logs. When that loop accumulates, it becomes the research OS described in Ch12.

Figure 2.2: Inter-layer data flows — five arrows with their protocol labels (CLAUDE.md, AGENTS.md, MCP, forum-scratch, claim-update) — illustration by author (gpt-image assisted)

2.6 The boundaries are not crisp

A taxonomy is a tool, not a truth. This four-layer cut does not classify every case perfectly. Still, two questions handle most of the blur.

First, is the artifact read, or is it called? A very detailed wiki page may contain pseudocode, I/O specs, dataset links, and a reproduction environment. That can look close to Paper-to-Agent. But the wiki page is still a human-readable artifact; Paper-to-Agent produces something programs call. The boundary blurs as wiki pages accumulate code snippets, but callability remains the test.

Second, is the system executing a human goal, or closing a discovery loop? Codex /goal's six-hour run ^[28] executed a human-defined task for a long time, so it is L_assoc. Karpathy's autoresearch ^[16] loops through hypothesis → mutation → evaluation → next mutation, so it moves toward L_scientist. The same coding-agent substrate can support both. The layer is determined by the loop structure, not by the tool name.

This blur is not a failure of the taxonomy. A useful taxonomy is not perfectly clean; it is clean enough to orient the reader and honest about where the edges soften. That is the role of this four-layer map.

Figure 2.3: The taxonomy is a tool — the four layers can be decomposed and then composed back into domain-specific research workflows — illustration by author (gpt-image assisted)

2.7 Onward to Ch3 — timeline and six-level maturity

The four-layer frame is a spatial taxonomy — what is being done. Ch3 adds a temporal-developmental taxonomy — within the same layer, systems can be ranked on autonomy across six levels, L0 (humans control every step) → L6 (autonomous wet-lab). It adopts the Agentic Researcher five-level taxonomy ^[1] and extends it with L0 (the LLM-use baseline) plus the L5/L6 split (dry-lab vs. wet-lab).

Ch3 also walks the 21 months from Sakana v1 (August 2024) to May 2026 in chronological order, locating when and where each of this chapter's four layers was named. The four-layer × six-level grid is that chapter's output.

To close: the four layers are L_wiki — knowledge synthesis (Karpathy gist + 6 OSS), L_p2a — conversion into tools (Stanford Paper2Agent), L_assoc — code/analysis execution (Codex /goal + Claude Code), L_scientist — closed-loop discovery (Sakana → Co-Scientist → AAR → autoresearch). That these can be separated is one contribution of the survey; that having separated them they can be composed is the second. Ch4–Ch6 cover L_wiki only, Ch7–Ch9 cover L_p2a + L_scientist, and Ch10–Ch12 cover composition and operation of all four.

References

Agentic Researcher, "The Agentic Researcher: A Practical Guide to AI-Assisted Research," arXiv:2603.15914, 2026. [Agentic Researcher, 2026]
Agentpedia, "Karpathy's LLM Wiki: The Complete Guide to His Idea File," Agentpedia, 2026. [Agentpedia, 2026]
AIwire, "Stanford's Paper2Agent Reimagines Scientific Papers as Interactive AI Agents," HPCwire AIwire, 2025-10-10. [AIwire, 2025]
Anthropic, "Automated Alignment Researchers — Using LLMs to scale scalable oversight," Anthropic Research, 2026-04-14. [Anthropic, 2026] #28
Anthropic, "Claude Code memory + subagent documentation," Anthropic Docs, 2026. [Anthropic, 2026]
Clark, Jack, "Import AI 454: Automating alignment research," Import AI, 2026-04-20. [Clark, 2026]
Denser.ai, "From RAG to LLM Wiki: What Karpathy's idea means for AI knowledge bases," Denser.ai Blog, 2026. [Denser, 2026]
Ghafarollahi, Alireza et al. (2024). SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning. arXiv:2409.05556. [Ghafarollahi et al., 2024]
Gottweis, Juraj et al. (2025). Towards an AI co-scientist (Google AI Co-Scientist). arXiv:2502.18864. [Gottweis et al., 2025] #11
HKUDS (2025). AI-Researcher: Autonomous Scientific Innovation. arXiv:2505.18705. [HKUDS, 2025]
InfoQ, "Paper2Agent Converts Scientific Papers into Interactive AI Agents," InfoQ, 2025-10. [InfoQ, 2025]
Izacard, Gautier et al. (2022). Atlas: Few-shot Learning with Retrieval Augmented Language Models. arXiv:2208.03299. [Izacard et al., 2022]
Jumper, John et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature 596: 583-589. [Jumper et al., 2021]
Karpathy, Andrej, "LLM Wiki — A pattern for building personal knowledge bases using LLMs," GitHub Gist, 2026-04-04. [Karpathy, 2026]
Karpathy, Andrej, "LLM Wiki announcement (Twitter/X thread)," Twitter/X, 2026-04-04. [Karpathy, 2026]
Karpathy, Andrej, "karpathy/autoresearch — AI agents running research on single-GPU nanochat training," GitHub, 2026-03-07. [Karpathy, 2026] #30
King, Ross D. et al. (2009). The Automation of Science. Science 324: 85-89. [King et al., 2009]
Lala, J. et al. (2024). PaperQA2 — Language agents achieve superhuman synthesis of scientific knowledge. arXiv:2409.13740. [Lala et al., 2024]
Lewis, Patrick et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. [Lewis et al., 2020]
Lu, Chris et al. (2024). The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery. arXiv:2408.06292. [Lu et al., 2024]
OpenAI, "Codex /goal Command," Ralphable, 2026. [OpenAI, 2026]
Packer, Charles et al. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560. [Packer et al., 2023]
Park, Joon Sung et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. UIST 2023. [Park et al., 2023]
Schmidgall et al. (2025). Evaluating Sakana's AI Scientist for Autonomous Research. arXiv:2502.14297. [Schmidgall et al., 2025]
Silver, David et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature 529: 484-489. [Silver et al., 2016]
skyllwt, "OmegaWiki — Wiki-centric full-lifecycle AI research platform on Claude Code (DAIR Lab, Peking University)," GitHub, 2026-04. [skyllwt, 2026]
Stanford team (2025). Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents. arXiv:2509.06917. [Stanford, 2025]
Tecton & Tide, "/goal: The Six-Hour Codex Run That Survived a Five-Hour Pause," Tecton & Tide Blog, 2026-04. [Tecton & Tide, 2026]
Um, Taewoong, "Democratization of research — three stages (document → in silico → physical)," terryum.ai, 2026-04-15. [Um, 2026]
Um, Taewoong, "Claude Code → Codex 이관 전략," terryum.ai, 2026-04-24. [Um, 2026]
Wang, Guanzhi et al. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. TMLR 2024. [Wang et al., 2023]
Willison, Simon, "Codex CLI 0.128.0 adds /goal," Simon Willison's Blog, 2026-04-30. [Willison, 2026]
Yamada, Yutaro et al. (2025). The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search. arXiv:2504.08066. [Yamada et al., 2025]
Papers2Code, "Papers2Code — AI Research to Code," Papers2Code, 2026. [Papers2Code, 2026]
Zhang, Xiangyue (2026). Deep Researcher Agent: An Autonomous Framework for 24/7 Deep Learning Experimentation. arXiv:2604.05854. [Zhang, 2026]