Part III: Paper-to-Agent에서 AI Scientist로 — 자율 발견의 진화

Chapter 7: The AI Scientist 계보 — Sakana 2408에서 AI Co-Scientist까지

집필일: 2026-05-22 최종수정일: 2026-05-22

7.1 18개월짜리 계보를 한 챕터로 묶는 이유

이 챕터는 한 가지 작업을 한다. 2024년 8월 12일 arXiv에 올라온 Sakana AI의 The AI Scientist v1을 출발점으로 잡고, 그로부터 21개월 뒤 2026년 5월까지의 자율 연구 시스템들을 하나의 계보로 줄세우는 일이다. 사용자가 책의 tail tracking 출발점을 그 논문으로 지정한 이유는 명확하다. Sakana v1 이전에도 Boiko et al.의 Coscientist ^[3]와 EPFL의 ChemCrow ^[4]가 wet-lab/도메인 도구 사용을 보여주었지만, "아이디어 생성 → 코드 → 실험 → 시각화 → 논문 작성 → 시뮬레이트된 리뷰"를 end-to-end로 돌린 시스템이 공식 논문으로 정식화된 것은 Sakana가 처음이었기 때문이다 ^[13].

문제는 그 다음에 일어났다. 2025년 2월 Google이 AI Co-Scientist를 ^[7] 공개했고, 4월 Sakana는 v2를 ICLR 2025 워크숍에 통과시켰으며 ^[24], 5월 HKUDS의 AI-Researcher가 NeurIPS 2025에 채택됐다 ^[9]. 2026년 4월 한 주 안에 Anthropic AAR ^[1]와 Karpathy autoresearch ^[11]가 같은 일주일에 발표됐고, 같은 달 Zhang Xiangyue의 Deep Researcher Agent가 24시간 자율 ML 실험 프레임워크를 공개했다 ^[28]. 18개월 동안 시스템이 누적된 결과, "AI Scientist"는 이제 단일 시스템 이름이 아니라 한 가족의 이름이 됐다.

계보로 보면 모순이 보인다. 인기 매체는 종종 Sakana v1의 simulated-NeurIPS 점수, Co-Scientist의 "10년치 연구를 2일 만에" 헤드라인 ^[15], AAR의 PGR 0.97 vs 인간 0.23 ^[1]을 같은 척도 위에 놓고 "AI Scientist가 인간을 넘었다"로 묶는다. 하지만 각 시스템이 실제로 측정하는 것은 다르다. Schmidgall et al.은 Sakana v1의 novelty assessment가 키워드 검색에 의존하기 때문에 micro-batch SGD 같은 이미 잘 알려진 개념도 "novel"로 분류한다는 점을 보였고 ^[18], Nature의 v2 후속 보도는 ICLR 워크숍을 통과한 1편의 논문도 리뷰어가 "novelty 부족, citation 부실, 일반화된 실험 섹션"이라는 비판을 남겼다는 점을 함께 적었다 ^[14]. 비교 불가능한 metric을 같은 축으로 묶는 popular-press 프레임은 이 챕터가 명시적으로 깬다.

본 챕터의 구성은 이렇다. 7.2에서는 Sakana 이전의 구조적 조상 — Princeton 그룹의 ReAct, Reflexion, ToT — 을 짚어 v1이 발명이 아니라 통합이었음을 분명히 한다. 7.3에서 Sakana v1의 end-to-end 파이프라인과 그 한계, 7.4에서 v2의 agentic tree-search와 Experiment Manager, 그리고 Schmidgall et al.의 critique이 왜 "honest negative result publishing" 규율의 모범이 되는지를 보인다. 7.5는 Google AI Co-Scientist의 멀티에이전트·Elo 토너먼트·wet-lab 검증, 7.6은 HKUDS AI-Researcher의 hierarchical decomposition, 7.7은 Zhang의 Deep Researcher Agent와 Agentic Researcher의 5-단계 autonomy taxonomy를 다룬다. 7.8에서 공통 패턴을 추출하고 — multi-agent, tool use, sandbox, memory, reflection, reviewer agent — , 7.9에서 "evaluation lags execution" 문제와 metric 비교 불가능성을 명시적으로 정리한다.

7.2 구조적 조상 — Princeton 그룹의 세 primitive

Figure 7.1: ReAct · Reflexion · Tree of Thoughts — Princeton 그룹의 세 reasoning primitive — illustration by author (gpt-image assisted)

Sakana v1이 등장하기 전, 2022–2023년 Princeton의 Shunyu Yao와 동료들은 자율 연구 시스템의 뼈대가 될 세 primitive를 차례로 내놓았다. 첫째, ReAct는 "Thought → Action → Observation" 토큰을 하나의 디코딩 스트림에 섞어 reasoning이 환경 관찰에 grounding되도록 만들었다 ^[27]. ALFWorld에서 +34%, WebShop에서 +10%의 성공률 향상이 절대치였고, 이후 거의 모든 agentic 시스템 — AutoGPT, AutoGen, Reflexion, Sakana — 이 이 패턴 위에서 변형되었다. ReAct는 reasoning과 acting을 하나의 trajectory로 묶었다는 점에서 closed-loop AI Scientist의 가장 작은 단위다.

둘째, Reflexion은 ReAct 위에 verbal RL 루프를 얹었다 ^[20]. 실패한 trajectory가 끝나면 evaluator가 reward를 매기고, Self-Reflection LLM이 자연어 post-mortem을 작성해 episodic memory에 적재한다. 다음 시도는 이 memory를 prompt에 끼워 시작한다. 가중치 업데이트 없이 prompt만 자라는 방식이라 비용 측면에서 가볍지만, 자기 비판이 그 자체로 연구 행위라는 점이 중요하다. HumanEval pass@1을 80%에서 91%로 끌어올린 ^[20] 결과는 "reflection이 작동한다"의 첫 양적 증거였고, Sakana v1의 simulated reviewer agent, Anthropic AAR의 forum-scratch 공유 메모리, Deep Researcher Agent의 Reflect 단계 모두 이 패턴의 직계 후손이다.

셋째, Tree of Thoughts (ToT)는 chain-of-thought를 트리 탐색으로 일반화했다 ^[26]. 각 부분 해를 노드로 두고 LLM이 자식 노드를 제안, value prompt로 자기 평가, BFS/DFS로 가지치기하는 방식이다. Game of 24에서 GPT-4 + CoT의 4%가 ToT로 74%까지 올라가는 약 18배의 향상이 보고되었다 ^[26]. Sakana v2의 BFTS(Best-First Tree Search)와 AI Co-Scientist의 Elo 토너먼트는 같은 "search over LM thoughts" primitive의 영역 특화 버전이다.

이 세 primitive 위에 또 한 줄기가 흘렀다. 도구 사용(tool use)이다. Toolformer는 LLM이 API 호출 위치를 self-supervise로 학습할 수 있음을 보였고 ^[17], HuggingGPT는 자연어 라우터로 model zoo를 호출하는 패턴을 보여주었다 ^[19]. 두 논문은 Anthropic MCP와 Stanford Paper2Agent의 architectural ancestor다. 그리고 KG×LLM 분야에서는 Think-on-Graph가 LLM을 KG 위에서 beam-search 에이전트로 정의했고 ^[21], 이는 MIT LAMM의 SciAgents (Chapter 9에서 다룰 materials AI Scientist)로 직접 이어진다.

Sakana v1을 이 격자 위에 놓으면 분명해진다. v1은 ReAct + Reflexion + ToT + tool use를 조합해 ML 연구 substrate에 적용한 첫 통합 시스템이지, 새 primitive를 발명한 것이 아니다. 이것이 v1의 가치를 깎지는 않는다. 합성 그 자체가 비자명했고, 합성이 가능해진 시점이 2024-08이었다는 것이 시대적 사실이다.

7.3 Sakana v1 — end-to-end 파이프라인의 첫 정식화

Figure 7.2: AI Scientist 계보 — Sakana v1 2024-08 → v2 2025-02 → Google Co-Scientist 2025-02 → Anthropic AAR 2026-04 → Karpathy autoresearch 2026-03 — illustration by author (gpt-image assisted)

Lu et al.은 2024년 8월 12일 The AI Scientist를 공개했다 ^[13]. 시스템은 여섯 단계 파이프라인으로 ML 연구 한 편을 만든다. (1) 아이디어 생성, (2) 관련 연구 검색, (3) Aider 기반 코드 작성 (Claude/GPT-4 위에서), (4) 실험 실행, (5) 결과 시각화, (6) 논문 작성 + simulated reviewer agent. 세 가지 substrate에서 검증되었다 — neural network 기반 diffusion modeling, transformer 아키텍처, grokking. 핵심 수치는 두 가지다. 논문 한 편당 약 $15의 비용으로 NeurIPS 스타일 초고가 생성됐고, simulated reviewer가 생성된 논문 일부를 NeurIPS 임계치 비교 가능한 점수로 받아들였다 ^[13].

이 두 수치를 신중히 읽어야 한다. "$15 / 논문"은 매력적이지만 substrate가 작은 ML 실험 세 종이고, simulated reviewer는 실제 NeurIPS 리뷰어가 아니라 같은 LLM 가족의 prompt instance이다. "내부 임계치 비교 가능"이지 "NeurIPS 채택"이 아니다. Lu et al. 자신도 이 한계를 명시했고, wet-lab 검증이 없다는 점, novelty assessment가 키워드 매칭 기반이라는 점을 limitations 절에서 적었다 ^[13]. 그럼에도 popular-press는 "AI가 NeurIPS 논문을 쓰다" 헤드라인으로 묶었고, 이것이 7.4에서 다룰 Schmidgall et al.의 critique이 등장하는 이유다.

v1의 architectural significance는 따로 있다. 그 전까지의 자율 시스템 — Auto-GPT ^[25], AgentVerse ^[5], MetaGPT ^[10], CAMEL ^[12], AutoGen ^[23] — 은 일반 multi-agent 코딩/대화 프레임워크였다. Boiko et al.의 Coscientist는 wet-lab 도메인 ^[3], ChemCrow는 chemistry tool 라우팅 ^[4]에 특화되었다. Sakana v1은 처음으로 "연구 산출물 (논문) 자체를 출력으로 하는 closed-loop"를 단일 시스템으로 묶었다. 이 정식화 이후의 모든 후속 시스템은 v1을 reference frame으로 삼는다 — 같은 substrate를 확장하든 (v2, AI-Researcher), 다른 도메인에 옮기든 (Co-Scientist 생물, Medical AI Scientist 임상), 더 좁은 task에 특화하든 (autoresearch ML training optimization, AAR alignment research) 마찬가지다.

7.4 Sakana v2와 Schmidgall et al. — Honest Negative Result Publishing의 규율

8개월 뒤인 2025년 4월, Yamada et al.은 The AI Scientist-v2를 공개했다 ^[24]. v2의 두 가지 구조적 변화는 명확하다. 첫째, v1이 의존하던 human-authored code template을 제거하고 ML 도메인 전반에 일반화했다. 둘째, v1의 선형 파이프라인을 Best-First Tree Search (BFTS)로 대체하고, 별도 Experiment Manager 에이전트가 연구 공간 탐색을 관리한다. 그리고 가장 눈에 띄는 실험: 3편의 AI 생성 논문을 ICLR 2025 "I Can't Believe It's Not Better" 워크숍에 제출했고, 그중 1편의 리뷰 점수가 워크숍의 인간 합격 임계치를 초과했다 ^[24]. 코드와 아티팩트는 GitHub에 공개되어 있다 ^[16].

이것이 "AI가 peer review를 통과한 첫 사례"라는 헤드라인으로 다뤄졌지만, Nature 후속 보도는 다른 면을 같이 적었다 ^[14]. 워크숍이라는 venue는 main track보다 합격 임계치가 낮고, 1/3 합격률은 small-n이며, 실제 리뷰어의 정성 코멘트에는 "체계적 약점 — context 부족, citation 문제, 일반화된 실험 섹션"이 남았다. 이 단서들은 v2 논문 자체의 limitations 절에 이미 있는 것이기도 하다 ^[24].

그리고 Schmidgall et al.이 등장한다. 2025년 2월, v1의 independent evaluation 논문이 arXiv에 올라왔다 ^[18]. 세 가지 발견을 정리한다. 첫째, v1의 literature review는 키워드 검색에 의존하고 deep synthesis를 수행하지 못해 novelty assessment가 약하다. 둘째, v1이 "novel"로 분류한 일부 아이디어는 이미 잘 알려진 것이다 — 대표 사례가 micro-batch SGD를 "novel"로 표시한 일이다. 셋째, seed-idea의 novelty/feasibility/interestingness 점수가 시스템 거동에 측정 가능한 영향을 주지 못한다. 한 단어로 정리하면, v1의 simulated reviewer는 simulated novelty도 simulated하게 평가한다.

본 서베이는 Schmidgall et al.을 단순 비판으로 다루지 않는다. 이것은 honest negative result publishing의 모범이다. 분야가 미숙할 때 가장 필요한 일은 자신의 결과를 과장하지 않는 규율이며, Schmidgall et al.은 동료 연구자가 그 규율을 외부에서 강제하는 정통 academic critique의 형태로 작동했다. 같은 규율을 Anthropic 자신이 AAR 본문에서 (Chapter 8 §8.4) Sonnet-4 transfer 실패와 reward-hacking을 paragraph-level로 공개함으로써 보였다. 두 사례는 "negatives를 publish하는 분야 규율"이 AI Scientist 가족의 신뢰성 인프라임을 보여준다. 이것이 v2의 ICLR 합격이라는 긍정 결과와 Schmidgall et al.의 부정 평가가 같은 챕터에 나란히 존재해야 하는 이유다.

7.5 Google AI Co-Scientist — Multi-agent, Elo 토너먼트, wet-lab 검증

2025년 2월 19일, Gottweis et al.은 Towards an AI co-scientist을 공개했다 ^[7]. Sakana v1과의 가장 명백한 차이는 도메인(ML → biomedical hypothesis generation)과 architecture(선형 파이프라인 → multi-agent)다. Gemini 2.0 위에 7종의 에이전트가 비동기로 동작한다. Generation (문헌 검색, simulated debate, 가정 식별), Reflection (initial / full / deep-verification / observation / simulation 다섯 가지 reviewer role), Ranking (top hypothesis들에 대해 multi-turn debate를 거치는 Elo 기반 pairwise 토너먼트), Proximity (중복 제거 클러스터링), Evolution (8가지 generative refinement 전략), Meta-review (review 패턴의 cross-cut), Supervisor (orchestrator). 초기 Elo = 1200에서 시작해 test-time-compute가 늘어날수록 high-Elo 가설이 누적된다.

수치는 인상적이지만 단서가 같이 따라온다. GPQA-diamond에서 최고 Elo 가설을 top-1로 선택했을 때 78.4% 정확도 ^[7]. 11개의 biomedical 전문가 큐레이션 목표에서 전문가 선호 순위 2.36 (lower-is-better; Gemini Pro 2.45, o1 2.45, Gemini Flash Thinking 2.73)으로 1위. Originality 3.64/5, Impact 3.09/5. 그리고 wet-lab — Acute Myeloid Leukemia (AML)에서 Binimetinib IC50 7 nM (MOLM-13), KIRA6 IC50 13 nM (KG-1) / 517 nM (MOLM-13). 간섬유증에서 3개의 새로운 epigenetic 표적과 4개의 drug 모두 fibroblast activation 감소 (p < 0.001 ~ p < 0.0001), 2개는 cytotoxicity 없음, 1개는 FDA 승인 약물. AMR (antimicrobial resistance) — cf-PICIs (chromosomal-island phage-tail) 메커니즘을 in silico에서 2일 만에 재도출. 인간 연구가 약 10년 걸린 결과다 ^[15].

그러나 Gottweis et al.이 limitations 절에 분명히 적은 것을 같이 읽어야 한다. open-access literature only — 유료/음성 결과 접근 불가. 약한 figure/graph reasoning. Elo는 self-referential metric이다. 그리고 가장 중요한 것: in vitro 성공 ≠ clinical 성공. 약동학, 약물 상호작용, 임상시험 설계는 scope 밖이다. AI Co-Scientist의 "10년치를 2일"이라는 popular-press 표현은 in silico 재도출에 한정된다는 것이 본문 안에 있다.

Co-Scientist는 v1과 다른 두 가지 architectural commitment를 보여준다. 첫째, scientist-in-the-loop를 명시적으로 채택했다 ^[6]. 사용자는 단순 소비자가 아니라 co-scientist를 지휘하는 PI 역할로 정의된다. 이것은 본 서베이의 human-in-the-loop disambiguation(approval / co-reasoning / correction) 중 co-reasoning 모드에 해당한다 (Chapter 8에서 다시 다룬다). 둘째, wet-lab 검증을 시스템 약속의 일부로 포함했다 — Guan et al.의 독립 wet-lab 복제 ^[8]가 1년 뒤 등장하면서 Co-Scientist 가설이 third-party replication을 살아남는다는 증거가 더해졌다 (Chapter 9).

7.6 HKUDS AI-Researcher — Hierarchical Decomposition과 두 단계 평가

2025년 5월, HKU Data Science (HKUDS) 그룹이 AI-Researcher를 NeurIPS 2025에 채택받았다 ^[9]. Co-Scientist가 biomedical에 특화된 것과 달리 AI-Researcher는 ML/CS 연구를 substrate로 한다. 구조는 hierarchical하다. Resource Analyst가 추상 개념을 구체 구현에 매핑하는 concept decomposition을 담당하고, Documentation Agent가 hierarchical synthesis를 수행하며, code-generation과 evaluation agent가 그 뒤를 잇는다. 모든 실행은 containerized workspace에서 일어나 안전성과 reproducibility를 보장한다.

가장 흥미로운 architectural 선택은 hierarchical evaluation framework다. 두 단계의 평가가 있다 — full-spec (사람이 연구 idea를 완전히 제공) vs sketch (사람이 idea의 스케치만 제공). 두 단계 모두를 Evaluator Agent가 채점한다. 같은 시스템을 "얼마나 인간 idea-provision에 의존하느냐"라는 축으로 측정한 첫 시도이며, 이것은 본 서베이의 6-Level Maturity Model에서 L3-L4의 경계 (Chapter 3)와 직접 맞물린다. 다만 한계도 명백하다 — Evaluator Agent 자체가 LLM이라 self-grading 편향 위험이 있고, 평가 기준 자체가 다른 시스템들에 채택되지 않아 공유 벤치마크로 작동하지 않는다. Production version은 novix.science/chat에서 공개되어 있다.

7.7 Zhang Deep Researcher Agent와 Agentic Researcher 5-단계 분류

2026년 4월에 두 가지 추가 시스템이 등장한다. 첫째, University of Tokyo의 Zhang Xiangyue가 Deep Researcher Agent (DRA)를 arXiv에 공개했다 ^[28]. 4단계 cycle — Think (이전 결과 분석 → 가설 → 실험 설계), Execute (코드 구현 → 필수 dry-run → GPU training 시작), Monitor (zero-LLM-cost OS-level 프로세스 체크 — 핵심 비용 혁신), Reflect (로그 파싱 → metric 평가 → 다음 액션). DRA의 architectural 혁신은 Zero-Cost Monitoring이다. Training은 wall-clock의 90% 이상을 차지하는데, 그 동안 LLM API를 호출하지 않고 OS-level signal만으로 모니터링한다. 그 결과 24시간 cycle당 약 $0.08의 비용으로 떨어진다 ^[28]. 500회 이상의 cycle이 실증되어 있다.

DRA가 보여주는 것은 budget-constrained AI Scientist의 가능성이다. AAR이 $18,000을 쓰고 (Chapter 8 §8.4), Sakana v1이 논문당 $15을 썼다면, DRA는 24시간 cycle을 $0.08로 떨어뜨렸다. 비용은 같은 cycle을 몇 번 돌릴 수 있느냐에 영향을 주고, cycle 수는 결과의 통계적 신뢰도에 영향을 준다. "AI Scientist는 비싸다"라는 가정이 architectural 선택으로 깨질 수 있다는 것이 DRA의 의의다.

둘째, Agentic Researcher 그룹이 arXiv:2603.15914에 5-Level autonomy taxonomy를 공개했다 ^[2]. Level 0 (full human control)에서 Level 4 (high agent autonomy)까지의 분류와, "commandments" — 에이전트 prompt에 baked-in되는 방법론적 규칙 ("every claim has a source ID", "experiments must dry-run before launch" 등) — 을 포함한다. open-source sandboxed framework와 함께 published되었다. 이 taxonomy는 Claude Code, Codex, OpenCode를 substrate로 가정한다. 본 서베이의 6-Level Maturity Model (Chapter 3)이 5-level vs 6-level의 차이를 가지지만, framing은 직접 정렬한다 — 우리 ladder의 L5/L6 split (dry-lab vs wet-lab)이 추가된 부분이고, L0 (one-shot summarization baseline)이 ladder 바닥에 추가된 부분이다.

7.8 공통 패턴 — 거대 prompt 하나가 아니다

Figure 7.3: Closed-loop AI Scientist — hypothesis · experiment · run · analyze · write 다섯 단계의 폐쇄 루프 — illustration by author (gpt-image assisted)

7.2–7.7에서 다룬 시스템 — Sakana v1/v2, Co-Scientist, AI-Researcher, DRA, AAR (Chapter 8에서 상세), Karpathy autoresearch (Chapter 8) — 을 architectural 축 위에 놓으면 공통점이 또렷이 보인다. 모두 multi-agent(Co-Scientist 7종, AAR 9 인스턴스, AI-Researcher 4종 specialized), tool use(Toolformer/HuggingGPT 후손, MCP-friendly), sandbox execution(containerized workspace, dry-run gate), memory(verbal RL/Reflexion 후손, file-based vs forum-based), evaluation(simulated reviewer / Elo / hierarchical eval / PGR), reflection(per-cycle post-mortem)을 조합한다. 그리고 reviewer agent가 별도 컴포넌트로 빠져 나와 있다 — Sakana v1의 simulated reviewer, Co-Scientist의 5가지 Reflection role, AI-Researcher의 Evaluator Agent.

이것이 "AI Scientist는 거대한 prompt 하나가 아니다"라는 ChatGPT seed §4.1의 주장이 의미하는 바다 ^[22]. closed-loop 자율성을 만드는 것은 단일 모델의 크기가 아니라 구조적 분해이다. 같은 작업을 multi-agent로 쪼개고, 각 에이전트에 tool을 붙이고, 그 사이에 memory와 reviewer를 끼우는 architectural 선택이 시스템 성능과 신뢰성의 대부분을 결정한다.

본 서베이는 이 공통 패턴을 다음과 같이 정리한다.

Loop 단계	책임 에이전트 (전형적)	도구·기억
Literature ingestion	literature agent (PaperQA2-스타일)	RAG + citation graph
Research gap extraction	reviewer agent + reflection LLM	wiki claims/contradictions
Hypothesis generation	generation agent + debate agent	episodic memory
Experiment design	planner agent + statistician agent	DOE template, KB
Code/protocol generation	coding agent (Aider/Codex/Claude Code)	sandboxed repo
Execution	execution agent + monitor	container, dry-run gate
Result analysis	analyzer agent	logs, metrics DB
Reflection	reflection LLM	episodic memory
Next experiment	planner agent	priority queue, Elo
Paper/report writing	writer agent + reviewer	LaTeX, simulated review

이 표가 새 architectural primitive를 발명하는 것은 아니다 — Princeton 그룹의 ReAct/Reflexion/ToT, Toolformer/HuggingGPT, ChemCrow/Coscientist가 이미 가지고 있던 부품들이다. 본 서베이의 기여는 이 부품들을 하나의 패턴으로 명명한 것이다. 같은 부품 조합이 ML(autoresearch), alignment(AAR), biomedical(Co-Scientist), clinical(Medical AI Scientist), materials(SciAgents)에 모두 등장하며 (Chapter 9), 이것이 우리가 layer로서의 "AI Scientist" (Chapter 2의 4-layer taxonomy 중 L4)를 말할 수 있는 이유다.

7.9 Evaluation lags execution — Metric은 아직 공유되지 않는다

본 챕터가 마지막으로 짚어야 할 것은 비교 metric의 문제다. 7.2–7.7에서 다룬 시스템들의 metric 목록은 이렇다 — Sakana v1: simulated reviewer score, per-paper $cost, novelty assessment. v2: ICLR 워크숍 리뷰어 점수, 1/3 합격률. Co-Scientist: GPQA-diamond top-1, expert preference rank, Elo, wet-lab IC50, p-value, in silico re-derivation time. AI-Researcher: hierarchical eval (full-spec vs sketch). AAR: PGR (Performance Gap Recovered), 시간당$ cost. autoresearch: val_bpb, Time-to-GPT-2, keep-worthy improvement count. DRA: $/24h cycle, cycle count.

시스템 간 공유되는 metric이 없다. GPQA는 graduate-level multiple-choice를 측정하지 hypothesis quality를 측정하지 않는다. PGR은 weak-to-strong supervision에 특화되어 있다. ICLR 워크숍 acceptance는 n=1이고 venue 특수적이다. simulated reviewer는 자기 가족의 LLM이다. Elo는 self-referential이다.

이것이 본 서베이의 G8 (Chapter 3에서 명시한 gap)에 해당한다. 공유 벤치마크의 부재는 "AI Scientist는 어디까지 왔는가"라는 질문에 정량 답을 어렵게 만든다. popular-press가 "Sakana vs Co-Scientist vs AAR"을 줄세우는 비교는 metric-incomparable한 축들을 같은 그림으로 묶는 일이다. 본 서베이가 이 챕터에서 systems를 genealogy로 묶되 benchmark ranking으로 묶지 않는 이유다. Schmidgall et al.은 한 시스템(v1)의 novelty assessment 신뢰성을 평가했다 ^[18]. 이 작업의 일반화 — "AI Scientist 가족 전체에 대한 공동 벤치마크" — 는 본 서베이의 horizon (2026-05)에서 아직 등장하지 않았다. 그 부재 자체가 findings의 하나다.

분야가 성숙하면 두 가지 일이 일어난다. 첫째, 공유 metric이 등장한다. 둘째, popular-press의 비교 헤드라인이 그 metric 위에서 검증된다. 본 챕터에서 본 18개월의 발전은 (1)의 시작이지 (2)의 완료가 아니다. 다음 챕터 (Chapter 8)에서 우리는 같은 패턴이 더 좁은 task에 어떻게 적용되는지를 본다 — Karpathy autoresearch의 ML training optimization, Anthropic AAR의 alignment, Stanford Paper2Agent의 paper → MCP 변환. 그리고 G3 (AAR의 production Sonnet-4 transfer 실패)를 paragraph-level로 다룬다. 그것이 7.4에서 시작한 "honest negative result publishing" 규율의 가장 최근 사례이기 때문이다.

참고문헌

Anthropic (2026). Automated Alignment Researchers — Using LLMs to scale scalable oversight. Anthropic Research, 2026-04-14. #28
Agentic Researcher (2026). The Agentic Researcher: A Practical Guide to AI-Assisted Research. arXiv:2603.15914.
Boiko, D. A., MacKnight, R., & Gomes, G. (2023). Emergent autonomous scientific research capabilities of large language models. arXiv:2304.05332; Nature 2023. DOI:10.1038/s41586-023-06792-0.
Bran, A. M., Cox, S., Schilter, O., Baldassari, C., White, A. D., & Schwaller, P. (2023). ChemCrow: Augmenting large-language models with chemistry tools. arXiv:2304.05376; Nature Machine Intelligence 2024.
Chen, W. et al. (2023). AgentVerse: Facilitating Multi-Agent Collaboration. arXiv:2308.10848.
Google AI (2025). Accelerating scientific breakthroughs with an AI co-scientist. Google Research blog, 2025-02-19. #11
Gottweis, J. et al. (2025). Towards an AI co-scientist. arXiv:2502.18864. #11
Guan et al. (2026). AI-Assisted Drug Re-Purposing for Human Liver Fibrosis. Advanced Science.
HKUDS (2025). AI-Researcher: Autonomous Scientific Innovation. arXiv:2505.18705; NeurIPS 2025.
Hong, S. et al. (2023). MetaGPT: Meta Programming for Multi-Agent Collaborative Framework. arXiv:2308.00352.
Karpathy, A. (2026a). karpathy/autoresearch. GitHub. #30
Li, G. et al. (2023). CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society. arXiv:2303.17760.
Lu, C., Lu, C., Lange, R. T., Foerster, J., Clune, J., & Ha, D. (2024). The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery. arXiv:2408.06292.
Nature News (2026). How to build an AI scientist: first peer-reviewed paper spills the secrets. Nature.
PsyPost (2026). Google's AI co-scientist just solved a biological mystery that took humans a decade. PsyPost. #11
Sakana AI (2025). SakanaAI/AI-Scientist-ICLR2025-Workshop-Experiment — Code release. GitHub.
Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761; NeurIPS 2023.
Schmidgall et al. (2025). Evaluating Sakana's AI Scientist for Autonomous Research: Wishful Thinking or an Emerging Reality?. arXiv:2502.14297; SIGIR Forum. DOI:10.1145/3769733.3769747.
Shen, Y., Song, K., Tan, X., Li, D., Lu, W., & Zhuang, Y. (2023). HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. arXiv:2303.17580; NeurIPS 2023.
Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366; NeurIPS 2023.
Sun, J. et al. (2023). Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph. arXiv:2307.07697; ICLR 2024.
Um, T. (2026a). AI Co-Scientist 요약 + 분석. terryum.ai paper post. #11
Wu, Q. et al. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155.
Yamada, Y. et al. (2025). The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search. arXiv:2504.08066.
Yang, J. (2023). Auto-GPT: An Autonomous GPT-4 Experiment. GitHub.
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601; NeurIPS 2023.
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629; ICLR 2023.
Zhang, X. (2026). Deep Researcher Agent: An Autonomous Framework for 24/7 Deep Learning Experimentation with Zero-Cost Monitoring. arXiv:2604.05854.