Part III: From Paper-to-Agent to AI Scientist — The Evolution of Autonomous Discovery

Chapter 9: Domain Case Studies — ML, Alignment, Biomedical, Materials, Medical

Written: 2026-05-22 Last updated: 2026-05-22

9.1 What this chapter is — a domain landscape, not a domain case study

This chapter is short on purpose. It deliberately does not go deep. As positioning.md §7 states in the "won't claim" list, this survey is not a wet-lab AI Scientist guide. The critical-analyst's recommendation runs in the same direction — Ch9 should sit at preview-tier landscape + strong pointers, and the deep coverage of each domain should be left to a follow-up survey one to two years out. The reason is G9 (the gap stated in Chapter 3). The wet-lab tier (L6) rests on roughly 5–6 primary sources within this survey's corpus, and trying to go deep on that thin corpus would unbalance the framing.

The domain landscape still matters. Without seeing how far the architectural patterns of Chapters 7–8 actually reach into each domain, the 4-layer × 6-level frame (Chapters 2, 3) stays abstract. This chapter covers five domains — ML, alignment, biomedical, materials, medical — and a Self-Driving Labs (SDL) bundle. ML and alignment were covered architecturally in Chapter 8, so here they sit as domain anchors in short form. Biomedical, materials, medical, and SDL get one or two paragraphs each as landscape, because they are new at this point in the survey. §9.7 extracts cross-domain patterns and §9.8 closes.

Per the critical-analyst's recommendation, the chapter stays short. Domain deep-dives are explicitly out of this survey's scope.

9.2 ML — Karpathy autoresearch as the ML domain anchor

Figure 9.1: Cross-domain AI Scientist matrix — ML, Alignment, Biomedical, Materials, Medical with autonomy, eval, human-in-loop, flagship — illustration by author (gpt-image assisted)

The most prominent ML-domain AI Scientist case is Karpathy's autoresearch covered in Chapter 8 §8.3 ^[13]. Chapter 8 covered the architectural commitments (630-line, single-loop, 5-minute experiment budget). The domain meaning — what an ML researcher can take from autoresearch — is the focus here.

Three things. First, for a well-defined small substrate, ML research is already amenable to autoresearch. nanochat is that substrate; val_bpb and the 5-minute budget are what make it fair-iterable. Second, the fact that many of the agent-found keep-worthy improvements were prior-known patches shows that autoresearch's value lies more in application than in discovery — it locates and applies patches already known in the field but not yet in the substrate. Third, the Shopify Liquid 53% speedup ^[23] demonstrated the same pattern transfers into production engineering — evidence that the engineering-flavored application is mature (G12; Chapter 8 §8.7).

The natural next step in the ML domain is substrate diversification. Does the pattern that works for nanochat work on larger training substrates? Across more diverse task families (vision, RL, multimodal)? Those are open questions within this survey's horizon (2026-05).

9.3 Alignment — Anthropic AAR as the alignment domain anchor

The alignment-domain AI Scientist case is the AAR covered in Chapter 8 §8.4 ^[2]. Chapter 8 covered the 9-instance design, PGR 0.97, $18k cost, and the G3 paragraph-level caveat (production Sonnet-4 transfer failure). The domain meaning is summarized here.

Two things AAR shows in the alignment domain. First, the scalable-oversight program ^[4] has landed inside a measurable system. The 2022–2023 research program produced the theoretical frame and the metric (PGR ^[7]); AAR ran a 9-instance multi-agent system on top of that frame. The domain has shifted from "is scalable oversight feasible?" to "how do we systemize it?". Second, honest negative result publishing has become part of the field's discipline. The fact that the production-Sonnet-4 transfer failure and reward-hacking report appear in the release body itself shows that the honest-limits posture is a community-acceptable outcome in the alignment domain. This forms a methodological pair with Schmidgall et al.'s Sakana v1 critique (Chapter 7 §7.4).

The next step in the alignment domain is also clear: finding methods by which PGR 0.97 transfers to production models. This is an open question within this survey's horizon, and per the field's honest framing the failure reflects current-model brittleness, not a fundamental limit — a model-specific phenomenon ^[8].

9.4 Biomedical — wet-lab validation of Google AI Co-Scientist

Figure 9.3: Google AI Co-Scientist biomedical follow-up — AML candidate, liver-fibrosis organoid, early in vivo — illustration by author (gpt-image assisted)

The most prominent biomedical-domain case is Google's AI Co-Scientist covered in Chapter 7 §7.5 ^[10]. Chapter 7 covered the architecture (multi-agent, Elo tournament, five Reflection roles). This chapter looks at the wet-lab validation.

Three cases. First, AML drug repurposing. For candidates Co-Scientist proposed, Binimetinib reached IC50 7 nM in MOLM-13, KIRA6 reached IC50 13 nM in KG-1 (517 nM in MOLM-13) ^[10]. In vitro validation, but quantitatively reported. Second, liver fibrosis — epigenetic targets. Co-Scientist proposed three novel epigenetic targets; all four drugs reduced fibroblast activation (p < 0.001 to p < 0.0001), two without cytotoxicity, and one an FDA-approved compound ^[10]. A year later, Guan et al. reported an independent wet-lab replication — two of the epigenetic targets confirmed in organoid + animal models with anti-fibrotic activity, and one FDA-approved drug as a repurposing candidate ^[11]. This is evidence that Co-Scientist hypotheses survive third-party replication. Third, antimicrobial resistance — cf-PICIs mechanism. Co-Scientist re-derived the chromosomal-island phage-tail mechanism in silico in two days ^[18], against roughly a decade of human work (the popular-press "ten years in two days" framing, which strictly applies to in silico re-derivation).

The Nature Medicine feature ^[1] chronicles how these results have moved into organoid, animal, and early clinical follow-ups. Nature News ^[15] frames the same arc as "first peer-reviewed paper" coverage. The field's honest framing is in vitro success is not clinical success — pharmacokinetics, drug interactions, and trial design are out of scope ^[10]. Co-Scientist's biomedical performance is a strong case for the hypothesis-generation layer; clinical use travels through a separate validation pipeline.

9.5 Materials — MIT LAMM's SciAgents (preview, no deep dive)

The most prominent materials-domain case is Ghafarollahi & Buehler's SciAgents ^[9]. The work of the MIT LAMM (Laboratory for Atomistic and Molecular Mechanics) group, published in Advanced Materials in 2025. Three pillars — a large ontological knowledge graph + an LLM+tool ensemble for retrieval + an in-situ-learning multi-agent system — applied to bioinspired materials discovery.

Two contrasts with Co-Scientist matter. First, the knowledge graph is the primary data structure. Co-Scientist centers on Elo-ranked hypotheses; SciAgents centers on an ontological KG over which multi-agent reasoning operates. The KG × LLM lineage traces back to Sun et al.'s Think-on-Graph ^[22] (the KG track named in Chapter 7 §7.2). Second, interdisciplinary bridges are an explicit contribution — e.g., surfacing connections like bone microstructure → composite materials that would otherwise be treated as unrelated.

This survey places SciAgents as an anchor case for the KG-track AI Scientist, not as a materials-domain deep dive. The next steps for the domain — wet-lab validation, transfer to other materials families — are out of scope. Code is open at lamm-mit/SciAgentsDiscovery.

9.6 Medical — Wu et al. Medical AI Scientist + Adam Nature Medicine pointer

The most prominent medical-domain case is Wu et al.'s Medical AI Scientist ^[26]. CUHK-AIM Group + Stanford + Microsoft Research, published in March 2026. The core novelty against generic AI Scientists is a clinician-engineer co-reasoning mechanism — turning surveyed literature into actionable evidence and improving the traceability of generated ideas. It operates in three research modes (paper-based reproduction, literature-inspired innovation, task-driven exploration), each corresponding to a different autonomy level. Evaluation: 171 cases × 19 clinical tasks × 6 data modalities, double-blind expert review plus Stanford's Agentic Reviewer, with manuscript quality reaching MICCAI level (consistently above ISBI and BIBM thresholds).

This survey's stance is plain. MICCAI-level alignment is measured by reviewer evaluation, not actual venue acceptance. This is the same framing as Sakana v2's ICLR workshop clearance (Chapter 7 §7.4) — the headline "AI cleared the venue" precisely means "reviewer scores measured against the venue's threshold." That does not diminish Wu et al.'s result. It is the right way to read which metric the field is using.

Adam's Nature Medicine feature ^[1] chronicles the shift in medical AI from chat-style biomedical LLMs to hypothesis-generating co-scientists. Next steps — clinical use, regulatory validation, accountability — belong to medical AI's own processes. This survey treats medical AI Scientist as a landscape pointer and leaves anything beyond that to a separate medical-AI survey.

9.7 Self-Driving Labs (SDL) — preview only

Figure 9.2: Self-driving lab — robotic chemistry workflow from King Adam 2009 to Pilon RoboChem-Flex 2026 — illustration by author (gpt-image assisted)

This section is the most explicitly preview-tier. As G9 named, the wet-lab corpus density inside this survey is insufficient. So this section sketches SDL's 17-year lineage briefly, alongside the fact that the lineage exists.

The lineage. King et al. 2009 — Adam ^[14] is the starting point. An autonomous wet-lab robot from the Aberystwyth group that formulates hypotheses about yeast gene function and designs, executes, and analyzes the experiments to test them. 12 of 20 novel hypotheses were manually verified true. The system ran continuous wet-lab cycles for roughly a year. Pre-LLM — symbolic reasoning + a hand-curated metabolic-pathway knowledge base — and this survey's pre-LLM L6 ancestor. The reason to cite Adam is plain: wet-lab autonomy is a 17-year program starting in 2009, and the 2026 LLM-driven systems are the latest cross-section of that program, not a sudden discovery.

How the lineage continued. Boiko et al. 2023 — Coscientist ^[3] — the first LLM-driven wet-lab AI Scientist after Adam, coupling GPT-4 with an Opentrons OT-2 liquid handler to autonomously synthesize Suzuki/Sonogashira cross-couplings (also discussed in Chapter 7 §7.3 as a pre-v1 architectural ancestor). EPFL's ChemCrow ^[5] in the same year, on the dry-lab side, ran 18 chemistry tools for synthesis planning + chromophore design. The two papers are a dry-lab/wet-lab pair published two weeks apart.

And 2026. Pilon et al. — RoboChem-Flex ^[17] — the Noël Research Group (University of Amsterdam, HIMS) modular self-driving lab at roughly $5,000. Python-based control, Bayesian optimization (with multi-objective + transfer-learning variants), and both autonomous closed-loop and human-in-the-loop modes. Six case studies (photocatalysis, biocatalysis, thermal cross-couplings, enantioselective catalysis, etc.). The differentiator is democratization — giving resource-limited groups access to autonomous experimentation ^[12]. Brazil's Nature feature ^[6] frames the SDL landscape as "AI sets strategy, robots execute, results feed the next experiment." Its headline message: "chemistry yes, materials informatics narrow, biology still hype."

QPillars's counter-take belongs alongside ^[19] — grading domain readiness and judging hype as the dominant note in biology. The field's honest framing is that chemistry works, materials informatics works in narrow cases, and biology is still immature.

This survey covers SDL briefly for two reasons. First, the primary corpus on SDL is Pilon 2026, Brazil 2026 Nature, Guan 2026 replication, QPillars 2026 counter-take, Boiko 2023, King 2009 — roughly 5–6 sources. Deeper coverage on that thin corpus would produce framing unproportional to the actual evidence density. Second, no robotics-domain AI Scientist primary source is in this corpus — robotics is a separate field and this survey defers to a separate survey (the tactile-sensing track). As wet-lab AI Scientist case studies accumulate over the next 1–2 years, a separate follow-up survey is the proper venue — this is this survey's stance.

9.8 Cross-domain — the same pattern, different weights

Across the five domains and the SDL bundle, the common pattern extracted in Chapter 7 §7.8 (literature → gap → hypothesis → design → execute → reflect) operates with different weights in different domains.

Domain	Dominant pattern slot	Dominant human-in-the-loop mode
ML (autoresearch)	execute (5-minute experiment + keep/discard)	last-mile correction
Alignment (AAR)	forum-shared reflection	approval-gate + last-mile correction
Biomedical (Co-Scientist)	hypothesis + Elo ranking	co-reasoning + wet-lab follow-up
Materials (SciAgents)	KG-grounded hypothesis	co-reasoning
Medical (Wu et al.)	clinician-engineer co-reasoning	co-reasoning
SDL (RoboChem-Flex)	execute (closed-loop + BO)	approval-gate

Which slot the same architectural primitive lands on per domain is one measure of that domain's maturity. ML and SDL weight the execute slot — the experiments themselves are fast-iterable. Biomedical, materials, and medical weight hypothesis and co-reasoning — experiments are expensive, so hypothesis quality decides the outcome. Alignment weights reflection — sharing across AARs is what restores PGR.

This loops back to the 6-level maturity model in Chapter 3. L5 (dry-lab AI Scientist) holds ML, alignment, and dry-lab biomedical/materials hypothesis generation. L6 (wet-lab / robot AI Scientist) holds SDL and biomedical wet-lab follow-ups. Within this survey's horizon (2026-05), L5 is dense and L6 is thin — the corpus-density gap G9 named.

9.9 Closing — what the domain landscape shows

This chapter covered five domains and the SDL bundle at preview-tier. The deliberate depth shortfall is the point. This survey's scope is the cleanest 4-layer × 6-level map (positioning.md §1), not a per-domain deep dive. Per-domain depth belongs to other surveys — biomedical to medical-AI surveys, materials to materials-informatics surveys, robotics to robotics surveys.

Still, showing a single-shot landscape of how the Chapter 7–8 architectural pattern applies to five domains is what this chapter contributes. The fact that the same pattern branches by domain — ML (execute weight), alignment (reflection weight), biomedical (hypothesis weight), materials (KG weight), medical (co-reasoning weight), SDL (execute + approval weight) — is the first sanity check of whether the 4-layer × 6-level frame matches domain reality. This survey judges that the sanity check passes — the taxonomies of Chapter 2 and Chapter 3 hold up against domain reality.

Part IV shifts the view — from domain case studies to building your own research OS. Chapters 10–12 cover the roadmap of starting an LLM Wiki with Claude Code/Codex + Obsidian and incrementally building an AI Scientist environment. The domain landscape of Chapter 9 becomes the destination of that roadmap.

References

Adam, D. (2026). The AI co-scientist is here (Nature Medicine feature). Nature Medicine. DOI:10.1038/s41591-026-04275-z. #11
Anthropic (2026). Automated Alignment Researchers — Using LLMs to scale scalable oversight. Anthropic Research, 2026-04-14. #28
Boiko, D. A., MacKnight, R., & Gomes, G. (2023). Emergent autonomous scientific research capabilities of large language models. arXiv:2304.05332; Nature 2023. DOI:10.1038/s41586-023-06792-0.
Bowman, S. R., Hyun, J., Perez, E., Chen, E., Pettit, C., Heiner, S., et al. (2022). Measuring Progress on Scalable Oversight for Large Language Models. arXiv:2211.03540.
Bran, A. M., Cox, S., Schilter, O., Baldassari, C., White, A. D., & Schwaller, P. (2023). ChemCrow: Augmenting large-language models with chemistry tools. arXiv:2304.05376; Nature Machine Intelligence 2024.
Brazil, R. (2026). Inside the self-driving lab revolution. Nature feature. #31
Burns, C., Izmailov, P., Kirchner, J. H., Baker, B., Gao, L., Aschenbrenner, L., et al. (2023). Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision. arXiv:2312.09390; ICML 2024.
Clark, J. (2026). Import AI 454: Automating alignment research. Import AI Substack.
Ghafarollahi, A., & Buehler, M. J. (2024). SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning. arXiv:2409.05556; Advanced Materials 2025. DOI:10.1002/adma.202413523.
Gottweis, J. et al. (2025). Towards an AI co-scientist. arXiv:2502.18864. #11
Guan et al. (2026). AI-Assisted Drug Re-Purposing for Human Liver Fibrosis. Advanced Science.
HIMS (2026). RoboChem Flex: democratisation of the autonomous synthesis robot. HIMS, University of Amsterdam, 2026. #31
Karpathy, A. (2026). karpathy/autoresearch. GitHub. #30
King, R. D., Rowland, J., Oliver, S. G., Young, M., Aubrey, W., Byrne, E., Liakata, M., Markham, M., Pir, P., Soldatova, L. N., Sparkes, A., Whelan, K. E., & Clare, A. (2009). The Automation of Science. Science 324(5923):85–89. DOI:10.1126/science.1165620.
Nature News (2026). How to build an AI scientist: first peer-reviewed paper spills the secrets. Nature.
Phys.org (2026). Low-cost robotic chemistry system can be built and deployed in any lab. Phys.org, 2026-04.
Pilon, S. et al., Noël, T. (2026). A flexible and affordable self-driving laboratory for automated reaction optimization (RoboChem-Flex). Nature Synthesis. DOI:10.1038/s44160-026-01053-0. #31
PsyPost (2026). Google's AI co-scientist just solved a biological mystery that took humans a decade. PsyPost. #11
QPillars (2026). Self-Driving Labs in 2026 — What Actually Works vs. What's Still Hype. QPillars blog. #31
Schmidgall et al. (2025). Evaluating Sakana's AI Scientist for Autonomous Research. arXiv:2502.14297.
Stanford (2025). Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents. arXiv:2509.06917.
Sun, J. et al. (2023). Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph. arXiv:2307.07697; ICLR 2024.
The New Stack (2026). Andrej Karpathy's 630-line Python script ran 50 experiments overnight without any human input. The New Stack.
Um, T. (2026a). Medical AI Scientist summary + analysis. terryum.ai paper post. #21
Um, T. (2026b). Self-Driving Labs summary + analysis. terryum.ai paper post. #31
Wu, H., Zheng, B., Song, D., Jiang, Y., Gao, J., Xing, L., Sun, L., & Yuan, Y. (2026). Towards a Medical AI Scientist. arXiv:2603.28589. #21