Part IV: External-Brain Tutorial — Build Your Own Research OS

Chapter 12: Build Your Own AI Scientist Environment — A Step-by-Step Roadmap

Written: 2026-05-22 Last updated: 2026-05-22

12.1 The Book's Last Chapter — Your First Chapter

Figure 12.1: Six-level roadmap — L0 no AI, L1 chat-assisted, L2 LLM Wiki, L3 Paper-to-Agent, L4 Research Associate, L5 AI Scientist — illustration by author (gpt-image assisted)

If (Chapter 10) was Day 1 and (Chapter 11) was the four-month worked example, this chapter is the roadmap for the next twelve months. In (Chapter 3) we proposed a six-level maturity model — L0 one-shot chat, L1 AI Research Assistant, L2 LLM Wiki, L3 Paper-to-Agent, L4 Agentic Research Associate, L5 dry-lab AI Scientist, L6 wet-lab AI Scientist. This chapter lays out, step by step, how much time each of L2–L6 takes, what to measure, and what to avoid.

Two core messages.

First, do not skip steps. The six levels are not a maturity classification — they are a chain of prerequisites. To reach L5 dry-lab AI Scientist, L4 Research Associate must be stable; for L4 to be stable, three key papers must be callable via L3 Paper-to-Agent; for L3 to work, the L2 LLM Wiki must already have those papers' claims, contradictions, and open questions organized. The four-layer taxonomy in (Chapter 2) — LLM Wiki / Paper-to-Agent / Research Associate / AI Scientist — is exactly that chain.

Second, L6 is the book's horizon, not its next step (G9). As (Chapter 9)'s 17-year lineage from Adam 2009 through Boiko 2023 to RoboChem-Flex 2026 shows, wet-lab autonomy is production-grade only in very narrow domains (mostly chemistry and biology). This chapter offers the currently visible signals for reaching L6, while honestly saying L5 is the realistic ceiling for most readers.

This chapter assembles (Chapter 9)'s domain case studies, (Chapter 7-8)'s AI Scientist genealogy, and (Chapter 6)'s wiki schema. It is the most prescriptive chapter of the book.

12.2 L2 LLM Wiki — The Level You Must Reach First

What you build: on top of 30-50 papers + 20 patents + 20 internal reports / failure cases for one topic, a vault where five page families — claim · contradiction · open-question · dead-end · concept map — are updated weekly.

Why this is the first step: as (Chapter 11) showed, every upper layer runs on the substrate of an LLM Wiki. Even AAR's nine Opus instances (Chapter 8) start from some organized prior knowledge. autoresearch runs on a human's organized judgment about which metrics matter. Without L2, L4 and L5 collapse into meaningless noise generators.

Time budget (honest range): five minutes for the first ingest (Chapter 10), 2-4 weeks for a functioning L2 vault. Aimaker's four-month report ^[12] and terry's own four months of operation (Chapter 11) agree that roughly 30% of the first month's content gets rewritten during lint. The realistic milestone is not "finished in a month" but self-correcting cycle running within a month.

Cost (G10): API calls $30-60/month, 1-2 GB disk, 3-5 hours/week of human time (lint + review). No GPU required (Chapter 10).

Measurable exit criteria: move to L3 only when all four are "yes."

Has wiki/claims/ accumulated 50+ source-anchored claims?
Does wiki/contradictions/ hold at least 5 — that is, have you found real cross-paper conflicts?
Do at least 3 entries in wiki/open-questions/ carry an experiment idea candidate?
Does the weekly lint cycle finish in under five minutes? (When it stretches, that is a wiki-rot signal.)

These four criteria are an onboarding-tier simplification of the evaluation-metric table in (Chapter 6) §6.x.

Risks:

Risk	Mitigation
Wiki rot — accumulating bad summaries and stale knowledge	git diff review, source-ID enforcement hook (Chapter 6)
Lost provenance — claims separated from their paper evidence	claim-schema enforcement, the minimum fields from (Chapter 10) §10.3
Over-synthesis — agent inventing conclusions not in the papers	fact / inference / speculation separation, hook #4
Model lock-in — wiki tied to a specific model's voice	keep markdown + git, check model-swap reversibility (Chapter 10) §10.7
Prompt injection — raw web docs corrupting agent instructions	raw sandbox, instruction hierarchy

12.3 L3 Paper-to-Agent — Make Three Key Papers Callable

What you build: from the L2 vault, pick the three most important papers and convert their algorithms, metrics, dataset loaders, and experiment protocols into Python modules or MCP tools. Apply the result to your own internal data to test "does that paper's method work on our problem?"

Why this is the next step: L2 stayed in the language layer. L3 compiles that language into executable code. We follow Stanford Paper2Agent's pattern from (Chapter 8) — the three case studies (AlphaGenome, ScanPy, TISSUE) ^[4] — but at smaller scale.

Selection criteria — which three papers?:

The code is published and actively maintained on GitHub (G7's mature-code branch)
Clearly defined metrics with reported numbers
Overlaps your domain but is not perfectly aligned (so the transfer is meaningful)

The flow for Paper2Agent conversion ^[4]:


1. /paper-search next surfaces candidates
2. Save the PDF to raw/papers/, generate a wiki page with /paper
3. Clone the GitHub repo, inspect tests/
4. Ask claude/codex: "wrap this repo as an MCP server"
5. Register in mcp.json, smoke-test from claude/codex
6. Pilot-run on your internal dataset
7. Record results in wiki/claims/ as an "internal-validation" claim

Time budget: 1-2 weeks per paper. Three papers: 6-8 weeks. Most of the time goes into step 6 — apply to internal data — not step 4 "wrap it" — internal data preprocessing, evaluation-metric alignment, and the rest.

Cost: additional $40-80/month in API calls (code generation and test iteration); $1-3/hour cloud GPU if the paper requires GPU inference. The point is that the cost stacks on top of the L2 infrastructure — nothing new to install.

Lighter alternative (G7's immature-code branch): if the paper's code is not clean or is itself a prototype, full Paper-to-Agent conversion is overkill — an LLM Wiki page-of-the-paper + executable notebook is enough. (Chapter 8) develops this tradeoff explicitly — Paper-to-Agent has ROI only on the mature-code branch.

Measurable exit criteria:

Are all three key papers applicable to your internal data? (Even "n/a" or negative outcomes count — they are data.)
Are the application results recorded as your own new claims in wiki/claims/?
Do the three paper-agents in mcp.json work in a fail-safe manner? (Is the data-boundary hook still alive?)

12.4 L4 Codex/Claude Code Research Repo — Research Associate

What you build: a research repo with AGENTS.md + CLAUDE.md + TODO.md + report.md + experiment-log.md. Claude Code or Codex updates literature, code, analysis, and report every week. Humans hold only direction, approval, interpretation, and high-risk decisions.

Difference from L3: L3 was about paper → tool conversion. L4 makes research itself the output of the agent loop. L3-style agents combine into a coherent research workflow.

Structure:


research-repo/
├── AGENTS.md / CLAUDE.md  # L4 instructions
├── TODO.md                # unfinished work
├── report.md              # latest results summary
├── experiment-log.md      # every experiment recorded
├── literature/            # imported from L2 LLM Wiki
├── code/                  # analysis and experiment code
├── notebooks/             # exploration notebooks
├── data/                  # internal data (data-boundary hook applied)
├── results/               # experiment outputs
└── review/                # human review comments

This is a research-repo variant of the vault terry operates in (Chapter 11). The critical difference is the care taken with data/ and experiment-log.md — this is where internal-data leakage and missing reproducibility become real risks.

Time budget: 3-6 months. The first month is for settling hooks and instructions; the second for the first experiment cycle; thereafter, weekly-cycle stabilization.

Cost: API $100-200/month, cloud GPU $200-400/month if needed, 10-15 hours/week of human time (review + direction). Still possible without an in-house GPU cluster.

Full activation of the seven hooks: (Chapter 10) §10.8 advised starting with hooks 1-2. At L4, all seven must be alive. Especially hook 4 (data boundary) and hook 6 (report sync).

G11 disambiguation of human-in-the-loop — L4 uses all three modes.

Approval-gate: human signs off before experiment code runs (hook #4)
Co-reasoning: human and agent debate on the same wiki page when evaluating new hypotheses
Last-mile correction: weekly review where the human corrects agent output

Through L3, last-mile correction dominates. From L4, approval-gate and co-reasoning take the larger share.

Measurable exit criteria:

Does at least one experiment cycle complete every week with its outcome recorded in report.md?
Can you pick five random entries from experiment-log.md and reproduce them?
Is there a case where the data-boundary hook actually blocked something? (If not, the hook may be dead.)
Is the 4-week rolling intervention rate (how often humans correct the agent) decreasing? (If not, the instructions need rewriting.)

12.5 L5 dry-lab AI Scientist — DOE, BO, Simulation

What you build: a polished closed loop where the agent generates hypotheses, designs experiments via DOE or Bayesian optimization, runs the simulation or computational experiment, analyzes results, and selects the next experiment. Humans only define objective, search space, constraints, and approvals.

Production-grade exemplars — what is currently possible:

autoresearch on nanochat ^[3]: 700 experiments / 2 days / 11% Time-to-GPT-2 reduction. The load-bearing example in (Chapter 8) §8.x.
AAR Opus 4.6 ×9 ^[1]: PGR 0.97 vs human baseline 0.23. But at Sonnet 4 scale the improvement was not statistically significant — the G3 caveat (Chapter 8). L5 closes the loop only at specific model scales.
AI Co-Scientist on AML ^[6]: drug-repurposing candidates validated in vitro. Single domain (biomedical hypothesis generation).
Deep Researcher Agent ^[21]: 500+ cycles with $0.08/24h zero-cost monitoring. The cost-engineering exemplar in (Chapter 8) §8.x.

Essential difference from L4: through L4, humans selected the next experiment. At L5, the agent proposes; humans only approve. This is the central shift in the AI Scientist genealogy of (Chapter 7-8) — from Sakana v1's end-to-end closed loop to AAR's nine-instance peer setup.

Time budget: 12-18 months from L4. The first six months are DOE/BO infrastructure; the next six are the first closed-loop attempts; thereafter, stabilization on one domain. Most solo researchers stop here — and that is not a problem. L5 is the production-R&D ceiling; the solo-researcher ceiling is usually L4.

Cost (G10): AAR-grade experiments typically run $5k-20k/run ^[1]. Following Deep Researcher Agent's zero-cost monitoring pattern ^[21] can compress this to $50-200/run. The key point is that engineering-flavored production cases such as Tobi Lütke's Shopify 53% speedup let outcome value justify the cost ^[13].

G12 — research vs engineering split: the two branches developed in (Chapter 8). The engineering branch (production-code optimization) is relatively mature at L5; the research branch (new hypothesis generation) is still contested. Sakana's ICLR workshop submission is the frontier of that contested branch; Schmidgall et al.'s [schmidgall2025aiscientisteval] novelty-assessment critique is its limit.

Risks:

Risk	Mitigation
Fake novelty — passing known ideas off as new	literature agent + reviewer agent + patent search (G14 negative-result loop)
Bad experiment design — statistically meaningless DOE	statistician subagent + human review
Lack of reproducibility — missing code/data/env records	container, seed, log enforcement hooks
Reward hacking — agent gaming the metric	to avoid repeating the AAR Sonnet 4 episode ^[1], run a separate hold-out metric
Internal-data leakage — agent sending sensitive data externally	local-model option, data-boundary hook, redaction

Measurable exit criteria:

Has a single hypothesis cycle (hypothesis → experiment → analysis → next hypothesis) closed without human intervention at least once?
Is the result organized into a form an external reviewer can examine — i.e., publishable to papers/ or surveys/?
Is PGR or a domain-appropriate metric statistically significantly better than the random baseline?

12.6 L6 Wet-Lab — Bounded Autonomy

What you build: the agent generates physical experiment protocols; the robot/lab automation runs them only after approval. QC data and sensor data flow back to the wiki automatically.

Why this is the horizon (G9): the six wet-lab L6 primary sources in (Chapter 9) — three RoboChem-Flex pieces + the Brazil Nature feature + the Guan replication + Boiko 2023 — are the entire body of evidence. And they all live in the single domain of chemistry/biology. Other domains (materials, medical, robotics) have almost no L6 cases yet. King et al. 2009 Adam's 17-year lineage shows L6 is a long-time-scale program.

Currently production-grade narrow areas:

Organic synthesis automation: RoboChem-Flex ^[9] shows a ~$5k modular self-driving lab is feasible. Six case studies (photocatalysis, biocatalysis, asymmetric catalysis, and more).
Drug-repurposing validation: AI Co-Scientist's AML follow-up ^[19], liver-fibrosis replication.

For most readers: L6 requires separate infrastructure investment. Without a chemistry/biology lab, it is not accessible. Framing it as the currently available next step would be dishonest. (Chapter 9) and (Chapter 12) state this explicitly — L6 coverage is landscape with strong pointers, not case-study density.

What you can still do: there are tasks where physical surrogates are sufficient. Example — in analytical chemistry, do LC-MS data analysis at L5 while humans handle synthesis; in medicine, do chart review at L5 while humans handle patient contact. The medical AI Scientist ^[8] in (Chapter 9) follows this pattern — clinical judgment is the human's, documentation is the agent's.

Risks (domain-critical):

Risk	Mitigation
Equipment safety — dangerous commands executed	Robot safety hook (#5), interlocks, SOP checks, in-person human approval
Regulation — chemicals, biological samples, disposal	separate safety system, environmental impact review
Data bias — AI-driven labs reinforcing existing space bias ^[22]	weekly sampling-diversity audit
Reproducibility — robot calibration drift	daily calibration log, baseline reagent controls

12.7 Evaluation Metrics — Common Across All Levels

Figure 12.2: Nine-metric evaluation framework — coverage, recency, reproducibility, throughput, novelty, success, cost, wiki rot, human burden — illustration by author (gpt-image assisted)

The evaluation-metric tables from (Chapter 6) §6.x and ChatGPT seed §11 unified across L2-L6.

Metric	Definition	L2 (LLM Wiki)	L3 (P2A)	L4 (RA)	L5 (dry-lab)	L6 (wet-lab)
Literature coverage	Fraction of key sources reflected in the wiki	≥ 70%	≥ 80%	≥ 90%	≥ 90%	≥ 90%
Claim provenance	Fraction of claims carrying a source ID	≥ 95%	≥ 95%	≥ 98%	≥ 99%	≥ 99%
Contradiction discovery	Contradictions found by the agent per month	≥ 5	≥ 5	≥ 10	≥ 10	≥ 5
Hypothesis quality	Expert score (novelty + feasibility + evidence)	n/a	n/a	≥ 3/5	≥ 4/5	≥ 4/5
Experiment cycle time	Average from hypothesis to analysis	n/a	days	days	hours	days-weeks
Reproducibility	Result reproducible in the same repo	n/a	high	required	required	required
Human intervention rate	Human interventions per week	high (lint review)	medium	medium → low	low	medium (safety)
Negative result capture	Fraction of failed experiments accumulated in the wiki	≥ 50%	≥ 70%	≥ 80%	≥ 90%	≥ 95%
Safety violations	Unauthorized risky actions attempted	n/a	n/a	0	0	0 (critical)

The last metric must not rise — it must fall at every level. The norm is not 0; the norm is converging to 0 fast. (Chapter 9)'s RoboChem-Flex case ^[9] shows exactly that trajectory.

12.8 G3 Once More — Honest Caveat Publishing

In (Chapter 8) we treated AAR's Sonnet-4 transfer failure as a load-bearing paragraph. The final prescriptive message of this chapter is the same ethic.

When PGR 0.97 happens, publish where it does not work alongside it.

What ^[14] truly means is not the publication of research results — it is the publication of research failures. The wiki/dead-ends/ directory in (Chapter 6)'s prescriptive schema, terry's honest 30%-rewrite-rate report in (Chapter 11) — these are different expressions of the same ethic. AAR's negative-result publishing ^[1] and Schmidgall et al.'s critique of Sakana v1's novelty assessment ^[7] are the two primary sources of that ethic in this book.

The democratization of research is not just the lowering of the entry cost of research. It is also the raising of research honesty. This is the last condition for (Chapter 1)'s paradigm shift to be a real paradigm shift.

12.9 What You Do Next — A Seven-Item Checklist

Figure 12.3: Twelve-week staged roadmap — W1-3 LLM Wiki, W4-6 Paper-to-Agent, W7-9 Research Associate, W10-12 closed loop — illustration by author (gpt-image assisted)

Having read this book, the following seven items are doable on the 24-hour / 1-week / 1-month scale.

[ ] 24 hours: create the ~/research-vault/ skeleton + git init (Chapter 10)
[ ] 24 hours: write AGENTS.md / CLAUDE.md — copy the seven rules from §10.4
[ ] 48 hours: one arXiv PDF → /paper or ingest → 3-5 pages in wiki/claims/
[ ] 1 week: write your first contradiction page (find a real conflict between two papers)
[ ] 1 week: record your first dead end (a hypothesis you tried and abandoned)
[ ] 2 weeks: establish the weekly lint review — 30 minutes, same time every week
[ ] 1 month: pass all four §12.2 exit criteria — 50+ claims, 5+ contradictions, 3+ open questions with experiment ideas, lint under five minutes

Those seven cover every step from L0 to L2. The L3-L6 that follow are deeper entry into a single topic, and the tools, cases, and patterns of (Chapter 8-9-10) are the guidebook.

One last line.

The real asset of research democratization is not the GPU or the self-hosted LLM. It is the connected markdown accumulating outside your skull, and the 30 minutes a week you spend linting it. Every chapter of this book is the justification for that, and terryum.ai together with this survey is its worked example.

References

Anthropic (2026). Automated Alignment Researchers — Using LLMs to scale scalable oversight. Anthropic Research, 2026-04-14. #28
Karpathy, A. (2026). karpathy/autoresearch — AI agents running research on single-GPU training loops. GitHub, 2026. #30
Karpathy, A., Y. He, X. Lee, et al. (2026). LLM Wiki — A pattern for building personal knowledge bases using LLM agents. GitHub Gist, 2026-04-04.
Stanford Paper2Agent Team (2025). Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents. arXiv:2509.06917.
Lu, C., Lu, C., et al. (2024). The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery. arXiv:2408.06292.
Gottweis, J., et al. (2025). Towards an AI co-scientist (Google AI Co-Scientist). arXiv:2502.18864. #11
Schmidgall, S., et al. (2025). Evaluating Sakana's AI Scientist for Autonomous Research. arXiv:2502.14297.
Wu, H., Zheng, B., et al. (2026). Towards a Medical AI Scientist. arXiv:2603.28589. #21
Pilon, S., et al. (2026). A flexible and affordable self-driving laboratory for automated reaction optimization. Nature Synthesis, 2026. #31
Brazil, R. (2026). Inside the self-driving lab revolution. Nature Feature, 2026. #31
Clark, J. (2026). Import AI 454: Automating alignment research. Import AI Substack, 2026-04-20.
Aimaker (2026). AI-powered second brain from LLM Wiki — 4-month report. Aimaker Substack, 2026.
The New Stack (2026). Karpathy's AutoResearch Ran 700 ML Experiments in 2 Days Without Human Input. Reported by Um, T., terryum.ai, 2026. [The New Stack, 2026]
Um, T. (terryum) (2026). Democratization of Research — three stages. terryum.ai post #25, 2026-04-15. [Democratization of Research, 2026]
Um, T. (terryum) (2026). Brain Augmentation — manifesto for AI-era self-generating knowledge environments. terryum.ai post #7, 2026-03-10.
Um, T. (terryum) (2026). AAR summary and analysis. terryum.ai paper post, 2026. #28
Fulkerson, A. (2026). Karpathy's Pattern for an LLM Wiki in Production. Personal Blog, 2026.
Data Science Dojo (2026). The LLM Wiki Pattern by Andrej Karpathy — 5-paper, 30-minute tutorial. Data Science Dojo Blog, 2026.
Adam, D. (2026). The AI co-scientist is here. Nature Medicine Feature, 2026-03-16. #11
Guan, Y., et al. (2026). Independent wet-lab replication of liver fibrosis target validation. Reported on terryum.ai paper post, 2026. [Guan et al., 2026] #31
Zhang, S., et al. (2026). Deep Researcher Agent — Think/Execute/Monitor/Reflect with zero-cost monitoring. Reported via terryum.ai, 2026. [Zhang et al., 2026]
Restrepo, G. (2026). Expanding diversity in chemical space. Nature Chemistry, 2026-03-19. [Restrepo, 2026]