Part III: Paper-to-Agent에서 AI Scientist로 — 자율 발견의 진화

Chapter 9: 분야별 실전 — ML · alignment · biomedical · materials · medical

집필일: 2026-05-22 최종수정일: 2026-05-22

9.1 이 챕터의 의도 — 분야 case study가 아니라 분야 landscape

본 챕터는 의도적으로 짧다. 그리고 의도적으로 deep dive를 하지 않는다. positioning.md §7의 "won't claim" 목록에 명시했듯이 이 서베이는 wet-lab AI Scientist 가이드가 아니다. critical-analyst의 권고도 같은 방향이다 — Ch9는 preview-tier landscape + strong pointers로 두고, 분야별 deep coverage는 1–2년 뒤 별도 후속 서베이로 다루는 것이 마땅하다. 이유는 G9 (Chapter 3에 명시한 gap)이다. wet-lab tier(L6)는 본 서베이 corpus 내에서 약 5–6개의 primary source에만 의존하는 thin coverage이고, 그 thin coverage 안에서 깊이 다루려 하면 분야 균형이 무너진다.

그래도 분야 landscape는 필요하다. Chapter 7–8의 architectural pattern이 어느 분야에 어느 정도까지 가 닿았는지를 보지 못하면 4-layer × 6-level 프레임 (Chapter 2, 3)이 추상적으로 남는다. 본 챕터는 다섯 분야 — ML, alignment, biomedical, materials, medical — 와 자가 구동 실험실 (self-driving labs, SDL) 한 묶음을 다룬다. ML과 alignment는 Chapter 8에서 architectural 측면을 이미 다뤘으므로 여기서는 분야 anchor로 짧게 정리하고, biomedical · materials · medical · SDL은 새로 등장하는 분야이므로 각각 1–2 단락의 landscape를 둔다. 9.7에서 분야 횡단의 패턴을 정리한다.

본 챕터는 critical-analyst의 권고에 따라 길이를 길지 않게 유지한다. 분야 deep dive는 본 서베이의 scope를 명시적으로 벗어난다.

9.2 ML — Karpathy autoresearch가 ML 분야 anchor

Figure 9.1: AI Scientist 분야 횡단 패턴 매트릭스 — ML · Alignment · Biomedical · Materials · Medical의 자율도·평가·human-in-loop·flagship — illustration by author (gpt-image assisted)

ML 분야의 AI Scientist 사례 중 가장 prominent한 것은 Chapter 8 §8.3에서 다룬 Karpathy autoresearch다 ^[13]. 8장에서는 architectural commitment (630-line, single-loop, 5-min experiment budget)를 보았다. 여기서는 분야적 의미를 본다. 즉 ML 연구자가 autoresearch에서 무엇을 가져갈 수 있느냐.

세 가지다. 첫째, 잘 정의된 small substrate에서 ML 연구는 이미 autoresearch가 가능하다. nanochat이 그 substrate이고, val_bpb와 5-min 예산이 그 substrate를 fair-iterable하게 만든다. 둘째, agent가 발견한 keep-worthy 개선의 다수가 prior-known patches라는 사실은 autoresearch의 가치가 발견보다 적용에 있음을 보여준다 — ML 분야에서 이미 알려져 있지만 substrate에 들어가 있지 않은 patch들을 빠르게 찾아 적용한다. 셋째, Shopify Liquid 53% speedup ^[23]은 같은 패턴이 production engineering에 transfer 가능함을 보였다 — engineering-flavored 응용이 mature하다는 증거다 (G12; Chapter 8 §8.7).

ML 분야에서 다음 자연스러운 step은 substrate 다양화다. nanochat에서 작동하는 패턴이 더 큰 학습 substrate에서도 작동할까? 더 다양한 task family (vision, RL, multimodal)에서 작동할까? 이 질문은 본 서베이의 horizon (2026-05)에서 open question이다.

9.3 Alignment — Anthropic AAR가 alignment 분야 anchor

Alignment 분야의 AI Scientist 사례는 Chapter 8 §8.4에서 다룬 Anthropic AAR다 ^[2]. 8장에서는 9-instance 디자인, PGR 0.97, $18k 비용, 그리고 G3 (production Sonnet 4 transfer 실패) caveat을 paragraph-level로 다뤘다. 여기서는 분야적 의미를 정리한다.

AAR가 alignment 분야에서 보여주는 것은 두 가지다. 첫째, scalable oversight 프로그램 ^[4]이 measurable한 system에 정착했다. 2022–2023 연구 프로그램은 이론적 frame과 metric (PGR ^[7])을 만들었고, AAR이 그 위에서 9-instance multi-agent system을 작동시켰다. 분야가 "scalable oversight는 가능한가"라는 질문에서 "scalable oversight를 어떻게 system화할 것인가"로 이동했다. 둘째, honest negative result publishing이 분야 규율의 일부가 됐다. Production Sonnet 4 transfer 실패와 reward-hacking 보고가 release post 본문에 들어갔다는 사실 자체가 alignment 분야에서 honest-limits posture가 community-acceptable한 outcome임을 보여준다. 이것은 Schmidgall et al.의 Sakana v1 critique (Chapter 7 §7.4)와 methodological 쌍을 이룬다.

Alignment 분야의 다음 step도 명확하다. PGR 0.97이 production 모델에 transfer되는 method를 찾는 일이다. 이것은 본 서베이의 horizon에서 open question이고, 분야의 honest framing에 따르면 현재 모델의 brittleness가 fundamental limit이 아니라 model-specific phenomenon이다 ^[8].

9.4 Biomedical — Google AI Co-Scientist의 wet-lab 검증

Figure 9.3: Google AI Co-Scientist biomedical 후속 — AML candidate · liver fibrosis organoid · early in vivo — illustration by author (gpt-image assisted)

Biomedical 분야의 AI Scientist 사례는 Chapter 7 §7.5에서 다룬 Google AI Co-Scientist다 ^[10]. 7장에서는 architectural 측면 (multi-agent, Elo tournament, Reflection의 5 reviewer role)을 보았다. 본 챕터에서는 wet-lab 검증을 본다.

세 가지 사례를 정리한다. 첫째, Acute Myeloid Leukemia (AML) 약물 재목적화. Co-Scientist가 제안한 candidate들에 대해 Binimetinib이 MOLM-13에서 IC50 7 nM, KIRA6이 KG-1에서 IC50 13 nM (MOLM-13에서 517 nM)을 보였다 ^[10]. in vitro 검증이지만 정량 결과로 보고됐다. 둘째, 간섬유증 후성유전학적 표적 발견. Co-Scientist가 3개의 새로운 epigenetic 표적을 제안했고, 4개의 drug 모두 fibroblast activation을 감소시켰다 (p < 0.001 ~ p < 0.0001), 2개는 cytotoxicity가 없고 1개는 FDA 승인 약물이다 ^[10]. Guan et al.은 1년 뒤 독립 wet-lab 복제를 보고했고 — 2개의 epigenetic 표적 in organoid + animal model에서 anti-fibrotic 활성, 1개의 FDA-approved drug 재목적화 가능성 ^[11] — 이것이 Co-Scientist의 hypothesis가 third-party replication을 살아남는다는 증거다. 셋째, 항생제 내성 (AMR) — cf-PICIs mechanism. Co-Scientist가 chromosomal-island phage-tail mechanism을 in silico에서 2일 만에 재도출했고 ^[18], 인간 연구가 약 10년 걸린 결과다 (popular-press 표현은 "10년치를 2일"; 정확히는 in silico re-derivation에 한정된 framing).

Nature Medicine feature ^[1]는 이 결과들이 organoid, animal, early clinical follow-up으로 어떻게 이어지고 있는지를 chronological하게 정리한다. Nature News ^[15]는 같은 분야의 "first peer-reviewed paper" framing을 다룬다. 분야의 정직한 framing은 in vitro 성공 ≠ clinical 성공이다 — 약동학, drug interaction, 임상시험 디자인은 scope 밖이다 ^[10]. Co-Scientist의 biomedical 성과는 hypothesis generation layer의 강한 사례지만, 임상 활용까지는 별도 검증 프로세스를 거친다.

9.5 Materials — MIT LAMM의 SciAgents (preview, no deep dive)

Materials 분야의 가장 prominent한 사례는 Ghafarollahi & Buehler의 SciAgents다 ^[9]. MIT LAMM (Laboratory for Atomistic and Molecular Mechanics) 그룹의 작업이고, Advanced Materials 저널에 2025년 게재됐다. 세 기둥 — 대규모 온톨로지 knowledge graph + LLM+tool ensemble retrieval + in-situ learning multi-agent system — 으로 bioinspired materials discovery에 적용된다.

Co-Scientist와의 차별점이 두 가지다. 첫째, knowledge graph가 primary data structure다. Co-Scientist는 Elo-ranked hypothesis를 중심에 두지만, SciAgents는 ontological KG를 중심에 두고 그 위에서 multi-agent reasoning이 일어난다. KG × LLM의 lineage는 Sun et al.의 Think-on-Graph ^[22] (Chapter 7 §7.2에서 언급한 KG-track)로 거슬러 올라간다. 둘째, interdisciplinary bridges가 명시적 contribution이다 — 예: bone microstructure → composite materials처럼 평소 unrelated하게 다뤄지던 분야 간 connection을 surface한다.

본 서베이는 SciAgents를 materials 분야의 deep dive가 아니라 KG-track AI Scientist의 anchor 사례로 둔다. 분야의 다음 step — wet-lab 검증, 다른 materials family로의 transfer — 은 본 서베이의 scope를 벗어난다. Code는 lamm-mit/SciAgentsDiscovery에 공개되어 있다.

9.6 Medical — Wu et al. Medical AI Scientist + Adam Nature Medicine pointer

Medical 분야의 가장 prominent한 사례는 Wu et al.의 Medical AI Scientist다 ^[26]. CUHK-AIM Group + Stanford + Microsoft Research의 2026년 3월 작업이다. generic AI Scientist 대비 핵심 novelty는 clinician-engineer co-reasoning mechanism — surveyed literature를 actionable evidence로 전환하면서 생성된 idea의 traceability를 개선한다. 세 가지 research mode (paper-based reproduction, literature-inspired innovation, task-driven exploration)로 동작하고, 각각이 다른 autonomy level에 대응한다. 평가는 171 case × 19 clinical task × 6 data modality, double-blind expert + Stanford Agentic Reviewer로 진행됐고 manuscript quality가 MICCAI 수준에 도달했다 (ISBI, BIBM 임계치 일관 상회).

본 서베이의 입장은 명확하다. MICCAI-level alignment는 reviewer evaluation으로 측정한 것이고, actual venue 합격이 아니다. 본 서베이가 Chapter 7 §7.4에서 다룬 Sakana v2 ICLR 워크숍 통과와 비슷한 framing — "AI가 venue에 통과했다"라는 헤드라인은 정확하게는 "venue 임계치에 측정된 reviewer 점수"를 의미한다. 이것은 Wu et al.의 결과를 깎지 않는다. 분야가 어떤 metric을 사용하는지를 정확히 보는 일일 뿐이다.

Adam의 Nature Medicine feature ^[1]은 medical AI 분야에서 chat-style biomedical LLM에서 hypothesis-generating co-scientist로의 이동을 chronological하게 정리한다. 분야의 다음 step — 임상 활용, 규제 검증, 책임 — 은 medical 분야의 별도 프로세스에 속한다. 본 서베이는 medical AI Scientist를 landscape pointer로 두고 그 너머는 medical AI 분야의 별도 서베이에 맡긴다.

9.7 Self-Driving Labs (SDL) — preview only

Figure 9.2: Self-Driving Lab — King Adam 2009과 Pilon RoboChem-Flex 2026의 robotic chemistry workflow — illustration by author (gpt-image assisted)

본 절은 가장 명시적으로 preview-tier다. G9가 짚은 그대로 — wet-lab corpus density가 본 서베이 안에서 충분하지 않다. 그래서 본 절은 SDL의 17년짜리 계보를 그 계보가 존재한다는 사실 자체와 함께 짧게 짚는다.

계보는 이렇다. King et al. 2009 — Adam ^[14]이 출발점이다. Aberystwyth 그룹의 자율 wet-lab 로봇으로, 효모(yeast) 유전자 기능에 대한 가설을 자율 생성하고 실험을 설계·실행·분석한다. 20개의 새 가설 중 12개가 manually verified true로 확인됐다. 약 1년 연속 wet-lab cycle을 돌렸다. LLM 이전 시대 — symbolic reasoning + hand-curated metabolic-pathway knowledge base — 의 작업이고, 본 서베이의 pre-LLM L6 ancestor다. Adam을 인용하는 이유는 분명하다: wet-lab autonomy 방향은 2009년부터의 17년 프로그램이고, 2026년의 LLM-driven 시스템들은 그 프로그램의 최근 단면이지 갑작스러운 발견이 아니다.

이 계보가 어떻게 이어졌는가. Boiko et al. 2023 — Coscientist ^[3]은 GPT-4 + Opentrons OT-2 액체 핸들러로 Suzuki/Sonogashira cross-coupling을 자율 합성한 사례 — Adam 이후 첫 LLM-driven wet-lab AI Scientist다 (Chapter 7 §7.3에서 v1 이전의 architectural ancestor로도 언급). EPFL의 ChemCrow ^[5]는 같은 해 dry-lab 측에서 18개 chemistry tool로 합성 계획 + chromophore 디자인을 수행했다. 두 논문은 2주 차이로 발표된 dry-lab/wet-lab pair다.

그리고 2026년. Pilon et al. — RoboChem-Flex ^[17] — Noël Research Group (University of Amsterdam, HIMS)의 약 $5,000짜리 모듈형 self-driving lab. Python 기반 제어, Bayesian optimization (multi-objective + transfer learning 포함), 자율 closed-loop와 human-in-the-loop 모드 모두 지원. 6개 case study (광촉매, 생촉매, 열적 cross-coupling, 비대칭 촉매 등). 차별점은 democratization — 자원이 제한된 그룹도 자율 실험에 접근할 수 있게 한다 ^[12]. Brazil의 Nature feature ^[6]는 SDL landscape를 "AI가 전략, 로봇이 실행, 결과가 다음 실험으로 되먹임"으로 정리한다. "chemistry yes, materials informatics narrow, biology still hype"가 핵심 메시지다.

QPillars의 counter-take도 같이 봐야 한다 ^[19] — domain별 readiness를 grading하면서 biology 분야에서는 hype가 우세하다고 평가한다. 분야의 honest framing은 chemistry 쪽이 작동하고 materials informatics는 narrow하게 작동하며 biology는 여전히 미숙하다는 것이다.

본 서베이는 SDL을 짧게 다룬다. 이유는 두 가지다. 첫째, primary source가 Pilon 2026, Brazil 2026 Nature, Guan 2026 replication, QPillars 2026 counter-take, Boiko 2023, King 2009 — 약 5–6개에 그치고, 이 corpus 위에서 deep coverage를 시도하면 corpus density에 비례하지 않는 framing이 된다. 둘째, robotics-domain AI Scientist primary source가 본 corpus에 없다 — robotics는 별도 분야이고, 본 서베이는 tactile sensing 등의 별도 서베이에 맡긴다. 다음 1–2년 안에 wet-lab AI Scientist case study가 누적되면 별도 후속 서베이로 다룸이 마땅하다는 것이 본 서베이의 입장이다.

9.8 분야 횡단 — 같은 패턴이 다른 도메인에 어떻게 적용되는가

다섯 분야와 SDL 묶음을 본 결과, Chapter 7 §7.8에서 추출한 common pattern (literature → gap → hypothesis → design → execute → reflect)이 분야마다 다른 weight로 작동한다는 사실이 보인다.

분야	dominant pattern slot	dominant human-in-the-loop mode
ML (autoresearch)	execute (5-min experiment + keep/discard)	last-mile correction
Alignment (AAR)	forum-shared reflection	approval-gate + last-mile correction
Biomedical (Co-Scientist)	hypothesis + Elo ranking	co-reasoning + wet-lab follow-up
Materials (SciAgents)	KG-grounded hypothesis	co-reasoning
Medical (Wu et al.)	clinician-engineer co-reasoning	co-reasoning
SDL (RoboChem-Flex)	execute (closed-loop + BO)	approval-gate

같은 architectural primitive가 분야의 어디에 weight를 두느냐가 분야 maturity의 한 측정치다. ML과 SDL은 execute slot이 dominant — 실험 자체가 fast iterable하기 때문이다. Biomedical과 Materials와 Medical은 hypothesis와 co-reasoning slot이 dominant — 실험 자체가 비싸기 때문에 hypothesis quality에서 승부가 난다. Alignment는 reflection slot이 dominant — 다른 AAR과의 공유가 PGR 회복의 핵심이기 때문이다.

이 관찰은 Chapter 3의 6-level maturity model로 다시 연결된다. L5 (dry-lab AI Scientist)에는 ML, alignment, dry-lab biomedical/materials hypothesis generation이 들어간다. L6 (wet-lab / robot AI Scientist)에는 SDL과 biomedical wet-lab follow-up이 들어간다. 본 서베이의 horizon (2026-05)에서 L5는 dense하고 L6는 thin하다 — 이것이 G9가 명시한 corpus density 차이다.

9.9 닫는 글 — 분야 landscape가 보여주는 것

본 챕터는 다섯 분야와 SDL을 preview-tier로 다뤘다. 깊이가 부족한 부분이 의도적이라는 점을 밝힌다. 본 서베이의 scope는 4-layer × 6-level의 cleanest map (positioning.md §1)이지, 분야별 deep dive가 아니다. 분야별 deep dive는 본 서베이의 horizon 너머의 작업이다 — biomedical은 medical AI 서베이의, materials는 materials informatics 서베이의, robotics는 robotics 서베이의 영역이다.

그래도 Chapter 7–8의 architectural pattern이 다섯 분야에 어떻게 적용되었는지의 single shot landscape를 보여준 것이 본 챕터의 기여다. 같은 패턴이 ML(execute weight), alignment(reflection weight), biomedical(hypothesis weight), materials(KG weight), medical(co-reasoning weight), SDL(execute + approval weight)로 분기한다는 사실은 4-layer × 6-level 프레임이 분야별 reality와 일치하는지의 first sanity check다. 본 서베이는 그 sanity check가 통과한다고 판단한다 — Chapter 2와 Chapter 3의 taxonomy가 분야 reality 위에서 작동한다.

Part IV에서는 시야를 바꾼다 — 분야별 case study에서 본인의 연구 OS를 짓는 일로. Chapter 10–12에서는 Claude Code/Codex + Obsidian으로 LLM Wiki를 시작하고, 자체적인 AI Scientist 환경을 단계적으로 짓는 roadmap을 다룬다. Chapter 9에서 본 분야별 landscape는 그 roadmap의 목적지가 된다.

참고문헌

Adam, D. (2026). The AI co-scientist is here (Nature Medicine feature). Nature Medicine. DOI:10.1038/s41591-026-04275-z. #11
Anthropic (2026). Automated Alignment Researchers — Using LLMs to scale scalable oversight. Anthropic Research, 2026-04-14. #28
Boiko, D. A., MacKnight, R., & Gomes, G. (2023). Emergent autonomous scientific research capabilities of large language models. arXiv:2304.05332; Nature 2023. DOI:10.1038/s41586-023-06792-0.
Bowman, S. R., Hyun, J., Perez, E., Chen, E., Pettit, C., Heiner, S., et al. (2022). Measuring Progress on Scalable Oversight for Large Language Models. arXiv:2211.03540.
Bran, A. M., Cox, S., Schilter, O., Baldassari, C., White, A. D., & Schwaller, P. (2023). ChemCrow: Augmenting large-language models with chemistry tools. arXiv:2304.05376; Nature Machine Intelligence 2024.
Brazil, R. (2026). Inside the self-driving lab revolution. Nature feature. #31
Burns, C., Izmailov, P., Kirchner, J. H., Baker, B., Gao, L., Aschenbrenner, L., et al. (2023). Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision. arXiv:2312.09390; ICML 2024.
Clark, J. (2026). Import AI 454: Automating alignment research. Import AI Substack.
Ghafarollahi, A., & Buehler, M. J. (2024). SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning. arXiv:2409.05556; Advanced Materials 2025. DOI:10.1002/adma.202413523.
Gottweis, J. et al. (2025). Towards an AI co-scientist. arXiv:2502.18864. #11
Guan et al. (2026). AI-Assisted Drug Re-Purposing for Human Liver Fibrosis. Advanced Science.
HIMS (2026). RoboChem Flex: democratisation of the autonomous synthesis robot. HIMS, University of Amsterdam, 2026. #31
Karpathy, A. (2026). karpathy/autoresearch. GitHub. #30
King, R. D., Rowland, J., Oliver, S. G., Young, M., Aubrey, W., Byrne, E., Liakata, M., Markham, M., Pir, P., Soldatova, L. N., Sparkes, A., Whelan, K. E., & Clare, A. (2009). The Automation of Science. Science 324(5923):85–89. DOI:10.1126/science.1165620.
Nature News (2026). How to build an AI scientist: first peer-reviewed paper spills the secrets. Nature.
Phys.org (2026). Low-cost robotic chemistry system can be built and deployed in any lab. Phys.org, 2026-04.
Pilon, S. et al., Noël, T. (2026). A flexible and affordable self-driving laboratory for automated reaction optimization (RoboChem-Flex). Nature Synthesis. DOI:10.1038/s44160-026-01053-0. #31
PsyPost (2026). Google's AI co-scientist just solved a biological mystery that took humans a decade. PsyPost. #11
QPillars (2026). Self-Driving Labs in 2026 — What Actually Works vs. What's Still Hype. QPillars blog. #31
Schmidgall et al. (2025). Evaluating Sakana's AI Scientist for Autonomous Research. arXiv:2502.14297.
Stanford (2025). Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents. arXiv:2509.06917.
Sun, J. et al. (2023). Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph. arXiv:2307.07697; ICLR 2024.
The New Stack (2026). Andrej Karpathy's 630-line Python script ran 50 experiments overnight without any human input. The New Stack.
Um, T. (2026a). Medical AI Scientist 요약 + 분석. terryum.ai paper post. #21
Um, T. (2026b). Self-Driving Labs 요약 + 분석. terryum.ai paper post. #31
Wu, H., Zheng, B., Song, D., Jiang, Y., Gao, J., Xing, L., Sun, L., & Yuan, Y. (2026). Towards a Medical AI Scientist. arXiv:2603.28589. #21