Glossary

Key terms in A-Z order. `(Ch N)` marks the chapter where the term is introduced or discussed in depth.

A

AAR (Automated Alignment Researcher): Anthropic's 2026-04 system where nine Claude Opus 4.6 instances autonomously ran weak-to-strong supervision research for five days, reaching PGR 0.97 versus a human baseline of 0.23 (Ch1, Ch7, Ch8, Ch12).

AGENTS.md: A Codex-compatible instruction file defining agent behavior rules — the tool-neutral counterpart to `CLAUDE.md` (Ch4, Ch10).

Agentic Research Associate: The third layer in the four-layer taxonomy. A stage where Claude Code/Codex runs in a sandbox repository to update literature, code, analysis, and reports as a research executor (Ch2, Ch10).

AI Co-Scientist: Google's multi-agent system built on Gemini 2.0. Self-improves hypotheses through generation, debate, evolution, and Elo tournament. Wet-lab validation in AML, liver fibrosis, and cf-PICIs (Ch1, Ch7, Ch9).

AI Scientist: The most autonomous of the four layers. Performs the full closed loop: hypothesis → experiment design → execution → analysis → paper/review (Ch2, Ch7, Ch12).

autoresearch: Karpathy's 2026-03 autonomous ML experimentation framework. An agent edits, runs, and evaluates a single `train.py` in a loop. 700 experiments in two days, 11% reduction in GPT-2 Time-to-Train (Ch1, Ch8).

B

Bounded Autonomy: An operating pattern for AI Scientist L6 where wet-lab/robot commands require explicit human approval gates. Contrasts with fully autonomous operation (Ch9, Ch12).

Brain Augmentation: Terry's essay framing (2026-03) — "research in the AI era equals building an environment for self-sustaining knowledge generation." The starting point of this survey (Ch1, Ch11).

C

claim schema: The page-level format for a research-grade LLM Wiki. Mandatory fields: Evidence / Confidence / Scope / Contradicts / Relevance / Next experiment / Owner (Ch6).

CLAUDE.md: Claude Code's instruction file for agent behavior rules. Claude-side counterpart to AGENTS.md (Ch4, Ch10).

Codex CLI: OpenAI's local-terminal coding agent. Version 0.128.0 (2026-04-30) added `/goal` long-horizon execution, permission profiles, and Worktree/Cloud (Ch10).

contradiction page: A page in an LLM Wiki that explicitly records conflicts between two sources or within a single source's data. A core defense against wiki rot (Ch6).

E

Extended Mind: Clark & Chalmers' 1998 *Analysis* paper proposing the parity principle — that external stores can be part of cognitive processes. The philosophical anchor for Brain Augmentation (Ch1, Ch11).

H

hook: A deterministic shell command run at specific points in the Claude Code agent lifecycle. Enforces rules such as citation checks, raw-source immutability, and test-before-run (Ch10).

honest negative result publishing: A field discipline cemented by Schmidgall et al.'s Sakana v1 critique and Anthropic AAR's Sonnet-4 transfer caveat — publish negative results as body content, not footnotes (Ch3, Ch7, Ch8, Ch12).

L

LLM Wiki: The first layer of the four-layer taxonomy. An external knowledge engine where an agent reads raw sources and accumulates/updates markdown wiki pages. Karpathy's 2026-04-04 gist defined the pattern, reaching 16M views (Ch2, Ch4).

M

MCP (Model Context Protocol): A standard interface connecting LLM agents to external tools, databases, and APIs. Used to integrate research sources like PubMed and Benchling (Ch4, Ch10).

MemGPT: Packer et al.'s 2023 framing of LLM context management through an OS-virtual-memory metaphor. A direct ancestor of LLM Wiki's archival store (Ch4).

O

open question page: A page in an LLM Wiki that explicitly records what is not yet known. A primary source for hypothesis generation (Ch6).

P

Paper-to-Agent: The second layer of the four-layer taxonomy. Converts papers from passive summaries into callable MCP tools or Python modules. Stanford Paper2Agent is the canonical example (Ch2, Ch8).

PGR (Performance Gap Recovered): An evaluation metric for weak-to-strong generalization. AAR reached 0.97 versus a human baseline of 0.23, but the Sonnet-4 transfer was not statistically significant (Ch7, Ch8).

R

RAG (Retrieval-Augmented Generation): The baseline pattern that retrieves chunks from a vector database at query time and injects them into the LLM. The contrast against which LLM Wiki defines itself — LLM Wiki performs synthesis at ingest time and maintenance time (Ch4).

S

Sakana The AI Scientist: Sakana AI's end-to-end AI Scientist system. v1 (arXiv:2408.06292, 2024-08) is the tail-tracking origin of this survey; v2 (arXiv:2504.08066, 2025-04) adds agentic tree search and an ICLR workshop submission (Ch3, Ch7).

SDL (Self-Driving Lab): A wet-lab system that automates experiment-protocol generation, robot execution, and result collection. King's Adam (Science 2009) was the first case; RoboChem-Flex 2026 is an LLM-integrated successor (Ch9, Ch12).

subagent: A specialized assistant in Claude Code with its own context window, system prompt, and tool access. Performs task-specific workflows (Ch10).

W

wiki rot: The degradation of an LLM Wiki over time through incorrect summaries, stale knowledge, and lost provenance. Empirical corpus as of this survey: n=2 (aimaker n=1 + Yu critique) (Ch6).

worked example (recursion): The meta moment in which the tools Terry used to produce this survey (Obsidian × terryum.ai × Claude Code/Codex × terry-surveys monorepo) simultaneously become the worked example of Chapter 11 (Ch11).

6-Level Maturity

L0 through L6: Six maturity levels for using AI research tools. L0 one-shot summarization → L1 Research Assistant → L2 LLM Wiki → L3 Paper-to-Agent → L4 Agentic Research Associate → L5 Dry-lab AI Scientist → L6 Wet-lab AI Scientist (Ch3, Ch12).