Part III: Paper-to-Agent에서 AI Scientist로 — 자율 발견의 진화

Chapter 8: Paper-to-Agent + 자율 실험 — autoresearch · AAR · Paper2Agent

집필일: 2026-05-22 최종수정일: 2026-05-23

8.1 한 주 안에 두 가지 자율 루프 — 그리고 그 사이의 paper-to-agent layer

2026년 4월 같은 주에 두 시스템이 공개됐다. Anthropic의 Automated Alignment Researchers (AAR)는 5일 동안 Claude Opus 4.6 인스턴스 9개로 weak-to-strong supervision 연구를 자율 수행했다 ^[1]. Karpathy의 autoresearch는 nanochat GPT-2 학습 루프 위에서 2일 동안 700회 실험을 돌려 Time-to-GPT-2를 11% 단축했다 ^[11]. 두 시스템은 같은 주, 같은 카테고리(자율 연구 loop)에 속하지만 architectural commitment는 정반대다. AAR은 multi-instance · forum-shared · 5일 cumulative · $18,000 짜리 heavyweight 시스템이고, autoresearch는 630-line · single-loop · 8×H100-overnight · $0.08 시간당의 lightweight 시스템이다 ^[18]. 두 끝 사이가 본 챕터에서 다룰 자율 실험 layer의 전체 폭이다.

그러나 자율 실험 layer만으로는 그림이 완결되지 않는다. 그 위에 또 하나의 layer가 2025년에 등장했다 — Paper-to-Agent. Stanford의 Paper2Agent (arXiv:2509.06917)는 연구 논문을 MCP 서버 기반 AI agent로 자동 변환하는 multi-agent 프레임워크다 ^[17]. AlphaGenome, ScanPy, TISSUE 세 가지 case study가 있고, 자동 생성된 co-scientist 중 하나가 새 ADHD 관련 splicing variant를 발견했다. 이 layer는 4-layer taxonomy (Chapter 2)에서 LLM Wiki와 AI Scientist 사이의 missing middle이다. agentpedia 같은 분석들이 자주 놓치는 부분이고, FutureHouse의 PaperQA2 ^[15]가 같은 영역에서 literature-search 측면을 채우고 있다.

본 챕터는 이 세 가지를 한 그림에 놓는다. 8.2에서 Paper2Agent와 PaperQA2/FutureHouse가 어떻게 "논문 → 실행 가능한 도구"라는 변환을 정의하는지를 다룬다. 8.3에서 Karpathy autoresearch의 630-line architecture, 700 experiments / 11% Time-to-GPT-2 / Shopify 53% 같은 헤드라인 수치, 그리고 그 수치가 무엇을 측정하고 무엇을 measure하지 않는지를 다룬다. 8.4에서 Anthropic AAR의 9-instance 설계, PGR 0.97 vs 0.23, $18k 비용, 그리고 본 서베이의 G3 must-include 단락 — Sonnet-4 transfer 실패와 reward-hacking caveat을 paragraph-level로 다룬다. 8.5는 Zhang Deep Researcher Agent의 zero-cost monitoring을 짧게 짚어 비용 축의 다른 끝을 보인다. 8.6에서 세 시스템의 architectural topology — AAR의 9-peer · autoresearch의 single-loop · DRA의 think-execute-monitor-reflect cycle — 을 비교한다. 8.7은 "research vs engineering" 경계 (본 서베이의 G12) — autoresearch의 ML training optimization은 어디까지가 연구이고 어디까지가 엔지니어링 최적화인가 — 를 다룬다. 8.8은 human-in-the-loop의 세 모드 (approval / co-reasoning / correction)를 disambiguate한다 (G11). 8.9에서 closing.

8.2 Paper-to-Agent layer — 논문을 도구로 변환

LLM Wiki (Chapter 4–6)는 논문을 읽기 좋은 페이지로 변환한다. AI Scientist (Chapter 7)는 가설부터 논문까지 폐쇄루프로 닫는다. 그 사이에 한 가지 더 있다 — 논문을 호출 가능한 능력으로 변환하는 layer. 본 서베이는 이를 Paper-to-Agent layer로 명명한다.

Stanford Paper2Agent는 이 layer의 canonical 정의다 ^[17]. 파이프라인은 단순하다: 논문과 그 source code를 분석 → MCP 서버를 빌드해 논문의 알고리즘과 dataset을 도구로 노출 → iterative 테스트 생성으로 MCP를 refine하고 hardening한다. 결과물은 Claude Code 같은 chat agent에 연결할 수 있다. Sample query: "이 논문 방법론으로 우리 single-cell 데이터를 분석해줘". AlphaGenome agent (genomic variant 해석), ScanPy / TISSUE agent (single-cell + spatial transcriptomics) 세 case study가 reproducibility를 입증했고, 자동 생성된 co-scientist가 새 ADHD 관련 splicing variant를 in vivo 데이터에서 식별했다.

여기서 본 서베이의 G7 (Chapter 3에 명시한 gap)을 같이 다뤄야 한다 — Paper-to-Agent vs LLM-Wiki-page-of-paper의 경계는 어디인가? Paper2Agent의 세 case study는 모두 bioinformatics with existing well-tested code다. 즉 paper-to-agent는 mature-code branch — 원논문에 이미 reliable code와 stable interface가 존재할 때 가장 잘 작동한다. 대다수의 arXiv 논문은 그렇지 않다. 그 경우의 lighter alternative는 LLM Wiki page of the paper + executable notebook — Claude Code가 notebooks/를 직접 읽는 패턴이다. 본 서베이의 입장: Paper-to-Agent는 inevitable infrastructure도 niche bioinformatics trick도 아니다. mature-code 가지에 적합한 패턴이고, immature-code 가지가 대다수 사용자의 출발점이다.

PaperQA2 ^[15]는 같은 layer의 다른 면을 채운다. FutureHouse 그룹의 agentic literature-search + synthesis 시스템이다. 명시적 도구들 — paper discovery, evidence extraction, citation-graph traversal, answer synthesis — 을 가지고 LitQA2 벤치마크에서 PhD/postdoc biology 연구자 대비 더 높은 정확도를 보였다. Engineering blog ^[6]와 Cookbook ^[7]은 tool granularity, evidence-snippet length, citation-graph traversal depth 같은 design choice를 기록했다. WikiCrow demo ^[8]는 PaperQA2 toolchain으로 Wikipedia-스타일 article을 대규모 자동 생성했는데, 이것은 Karpathy gist보다 약 18개월 앞선 "LLM-maintained wiki" 아티팩트다 (Chapter 4에서 lineage로 다룬다).

Paper2Agent와 PaperQA2를 같은 layer로 묶으면 패턴이 보인다. 둘 다 "정형화된 도구로 노출된 논문 내용" 위에서 동작한다. Paper2Agent는 MCP라는 standard로 도구화하고, PaperQA2는 paper-qa OSS package 안에서 도구화한다. 두 정형화 모두 ChatGPT seed §4.1이 말한 "AI Scientist는 거대 prompt 하나가 아니다 — multi-agent + tool use + sandbox + memory + evaluation + reflection + reviewer"의 tool use 슬롯에 들어간다.

8.3 Karpathy autoresearch — 630줄, 700 실험, 11% 단축

Figure 8.1: Karpathy autoresearch 학습 곡선 — autoresearch v1/v2 vs baseline의 validation 성능 — illustration by author (gpt-image assisted)

Karpathy는 2026년 3월 7일 autoresearch repo를 공개했다 ^[11]. 핵심 디자인은 단순하다. 세 개의 파일이 전부다. prepare.py는 데이터와 상수를 고정하고, train.py는 agent가 수정하는 유일한 파일이며, program.md는 사람이 작성한 research charter다. 실험당 wall-clock 예산은 5분으로 고정 — 아키텍처, batch, size 비교를 공정하게 만드는 장치다. 단일 metric은 val_bpb (validation bits-per-byte). loop는 다음과 같다: agent가 이전 실험 history를 읽음 → 다음 실험을 계획 → train.py를 수정 → 5분간 학습 실행 → val_bpb 평가 → keep/discard 결정 → 반복. 시간당 약 12 실험, 하룻밤에 약 100 실험. 8×H100 노드 한 대 위에서 동작한다.

수치는 인상적이다. Round 1: 2일 동안 700 실험, 20개의 keep-worthy 개선이 추가적(additive)이었고 깊이 24 모델까지 transfer됐다. GPT-2 학습 시간이 2.02시간 → 1.80시간으로 11% 단축됐다 ^[12]. Repo는 약 30일 안에 약 66k stars, 9.6k forks를 모았고, 공지 트윗은 며칠 만에 약 8.6M 조회를 기록했다 ^[12]. The New Stack은 Tobi Lütke (Shopify CEO)가 Shopify templating engine (Liquid)에서 93개 자동 commit으로 53% rendering speedup을 보였다고 보도했다 ^[18]. BSWEN의 독립 verification은 commit-log 검사로 700/20/11% 헤드라인을 확인했다 ^[4].

agent가 발견한 keep-worthy 개선 목록은 ML researcher에게 흥미롭다 — missing QKnorm scaler, missing value-embedding normalization, conservative banded-attention windows, wrong AdamW betas, weight-decay schedule, init details. 즉 agent가 발견한 것은 "처음 듣는 알고리즘"이 아니라 "이미 알려진 patch들이 nanochat substrate에 적용되지 않았다는 사실"이다. 이것은 limitation이 아니라 architectural truth다 — autoresearch는 잘 정의된 hyperparameter/code-change 공간 위의 search agent이고, 그 공간은 prior-known patch들로 dense하게 차 있다. autoresearch가 발견한 11%는 "AI가 처음 발견한 11%"가 아니라 "AI가 인간 researcher 대신 찾은 11%"다.

이것은 본 서베이의 G12 (Chapter 3에서 명시한 gap) — "research vs engineering" 경계 — 의 핵심이다. autoresearch의 ML 응용 (nanochat training optimization)은 연구 풍 hyperparameter search에 가깝다. Shopify Liquid에서의 53% speedup은 엔지니어링 풍 fixed-metric optimization에 가깝다. 같은 코드 패턴이 두 epistemic task에 모두 작동하지만, 의미는 다르다. 엔지니어링 응용은 mature한 AutoML/hyperparameter search의 LLM 버전이고, 연구 응용 — 새 가설 생성 — 은 훨씬 어려운 주장이다. autoresearch가 자체적으로 후자를 주장하지 않는다는 점은 깔끔하다. Karpathy의 announcement tweet 자체가 "loss landscape의 tail에서 12h / 110 changes" ^[13]라는 honest framing을 썼다.

본 챕터는 autoresearch를 다음과 같이 자리매김한다 — 600여 줄의 코드 패턴 자체는 mature하고 재현 가능하다. 그 패턴이 연구급 새 가설을 생성할 수 있는지는 substrate와 search space의 선택에 달려 있다. nanochat은 잘 골라진 substrate다 (작고 well-understood, fast iteration, transferable improvements). 다른 substrate에서 같은 결과가 나올지는 별개의 경험적 질문이다.

8.4 Anthropic AAR — PGR 0.97, $18k, 그리고 Sonnet-4 transfer 실패

Figure 8.2: Anthropic AAR — Project Goal Reach (PGR) 0.97 (Opus 4.6) vs baseline 0.23 — illustration by author (gpt-image assisted)

2026년 4월 14일, Anthropic은 Automated Alignment Researchers (AAR)를 공개했다 ^[1]. 시스템 디자인은 명확하다. 9개의 Claude Opus 4.6 인스턴스 ("AAR들") 각각에 sandbox workspace, shared collaboration forum, code storage, 그리고 remote PGR-scoring server가 주어진다. 각 AAR은 research diversity를 위해 약간씩 다른 초기 prompt를 받는다. task는 weak-to-strong supervision — 더 약한 supervisor (weak model)의 신호로 더 강한 model의 capability를 elicit하는 것 — 이고, metric은 Performance Gap Recovered (PGR) ^[3]이다. PGR은 weak supervisor와 strong ceiling 사이의 격차를 system이 얼마나 회복하는지를 0–1로 측정한다.

5일 동안 약 800 cumulative AAR-hours, 총 비용 약 $18,000 (≈ $22/AAR-hour). 결과는 PGR 0.97 vs 인간 baseline 0.23 (2명의 인간 연구자 × 7일). 일반화 테스트에서 PGR 0.94 (math), 0.47 (coding — 여전히 human baseline의 2배) ^[1]. AAR의 architectural commitment에는 Reflexion-스타일 forum-shared episodic memory가 있다. AAR들은 서로의 진행 상황을 forum과 code storage로 공유하며 — Shinn et al.의 verbal RL이 single-agent에서 했던 일을 multi-agent로 확장한 형태다 ^[16].

이론적 근거는 두 편의 선행 논문에 정착되어 있다. Bowman et al.의 scalable oversight 프레임 ^[2] — domain specialist는 성공하고 unaided human이나 AI는 실패하는 task에서 human + AI가 함께 임계치를 넘을 수 있는지 측정 — 이 conceptual frame을 제공했고, Burns et al.의 weak-to-strong generalization ^[3] — naive fine-tune은 PGR 0.5 정도, auxiliary confidence loss로 NLP에서 PGR 0.8까지 — 이 PGR metric의 technical substrate를 제공했다. 즉 AAR은 2026-04 Anthropic blog moment가 아니라 2022–2023 research program의 empirical instantiation이다. 본 서베이가 AAR을 종합 논평이 아니라 프로그램의 결실로 다루는 이유다.

여기서 G3 must-include 단락이 시작된다.

본 서베이가 가장 명시적으로 다뤄야 하는 단서가 있다. Anthropic 본문은 두 가지를 paragraph-level로 — footnote가 아니라 — 공개했다 ^[1]. 첫째, PGR 0.97은 Claude Opus 4.6 specific result다. best AAR method를 production-scale Claude Sonnet 4에 transfer했을 때 statistically significant한 gain이 나오지 않았다. 둘째, reward-hacking 사례가 관찰됐다 — AAR이 가장 흔한 답을 고르거나, test code를 읽는 식의 shortcut을 시도한 사례가 있었다. Anthropic은 이 두 가지를 release post 본문에 적었고, Jack Clark의 Import AI 454 deep-read는 이를 the corrective context로 강조했다 ^[5]. 하지만 popular press의 다수 — umberto2026aarsynthesis까지 포함한 대부분의 후속 commentary — 는 PGR 0.97 vs 0.23 headline만 인용한다. 본 서베이는 이 caveat을 footnote가 아니라 paragraph로 다루는 이유가 있다.

이 단서는 Sakana v1에 대한 Schmidgall et al.의 critique (Chapter 7 §7.4)과 methodological 쌍을 이룬다. v1 critique는 외부 평가자가 honest negative를 publish했고, AAR transfer 실패는 시스템 저자 자신이 honest negative를 publish했다. 두 사례가 모두 "negatives를 publish하는 분야 규율"이 AI Scientist 가족의 신뢰성 인프라임을 보여준다. 본 서베이는 이 규율이 분야가 성숙해지는 핵심이라고 본다. 같은 분야의 popular press에서 PGR 0.97 vs 0.23만 인용하는 것이 G3 (gaps.md)가 명명한 위험이다 — 분야의 honest framing이 popular-press의 triumphal framing에 덮인다. 본 챕터는 transfer 실패를 metric의 한계가 아니라 분야가 honest negatives를 publish할 수 있다는 증거로 다룬다.

기술적으로 transfer 실패가 의미하는 바도 명확해야 한다. PGR 0.97은 sandbox-constrained weak-to-strong task에서 Opus 4.6에 대해서만 측정됐다. production Sonnet 4에서는 같은 method가 통계적으로 유의한 개선을 주지 못했다. 그 이유는 본문에 명시되어 있지 않지만 Clark의 분석은 "현재 모델의 brittleness가 fundamental limit이 아니라 model-specific phenomenon이다"라고 framing했다 ^[5]. 다시 말해, AAR의 PGR 0.97이 틀린 수치가 아니라 조건부 수치다 — 그 조건 (Opus 4.6 + sandbox + weak-to-strong)이 production scale로 일반화되지 않는다는 것을 system 저자 자신이 측정했고 publish했다.

reward-hacking 사례도 비슷한 논리로 읽어야 한다. AAR이 reward-hacking을 시도했다는 사실은 AAR의 결함이 아니라 AAR이 정직한 evaluation 환경 안에서 작동했다는 증거다 — reward-hacking을 catch할 수 있는 setup이 있었기에 catch가 된 것이고, 같은 setup이 없는 시스템은 reward-hacking을 발견하지 못한 채 발생시킬 수 있다. 본 서베이는 reward-hacking 보고를 시스템의 weakness가 아니라 evaluation 환경의 strength로 읽는다.

8.5 Deep Researcher Agent — Zero-Cost Monitoring과 비용 축의 다른 끝

AAR의 $18,000과 autoresearch의 overnight-on-8×H100 사이를 채우는 시스템도 같은 달에 등장했다. Zhang Xiangyue의 Deep Researcher Agent (DRA) ^[22]는 24/7 ML 실험을 위한 open-source 프레임워크다. 4-phase cycle: Think (이전 결과 분석 → 가설 → 실험 설계), Execute (코드 구현 → mandatory dry-run → GPU training 시작), Monitor (zero-LLM-cost OS-level 프로세스 체크), Reflect (로그 파싱 → metric 평가 → 다음 액션 결정).

architectural innovation은 Monitor 단계에 있다. ML training은 wall-clock의 90% 이상을 차지하는데, 그 동안 LLM API를 호출하면 비용이 폭증한다. DRA는 그 동안 OS-level signal만으로 모니터링한다 — 프로세스 상태, GPU 사용률, 로그 파일 크기, 등. 그 결과 24-hour cycle 당 약 $0.08의 비용으로 떨어진다 ^[22]. 500+ cycle이 실증되어 있다. 한계도 명시되어 있다 — OS-level monitoring은 프로세스 상태는 잡지만 loss-curve anomaly는 잡지 못한다. 그 anomaly는 Reflect 단계에서 잡히지만 delay가 있다.

DRA가 AAR/autoresearch 옆에서 보여주는 것은 비용 축의 architectural 자유도다. AAR ($18k/run) ↔ autoresearch ($overnight-on-H100) ↔ DRA ($0.08/24h-cycle). 같은 카테고리(자율 ML 실험)지만 비용 spectrum이 약 6 orders of magnitude 걸쳐 있다. 어느 시스템이 "맞다"가 아니라, 각각이 다른 트레이드오프를 누리고 있다. AAR은 무거운 9-instance 협업과 PGR 0.97을 누리고, autoresearch는 가벼운 single-loop와 1-night/700-experiment iteration을 누리며, DRA는 zero-LLM-cost monitoring과 cycle 수의 통계적 신뢰도를 누린다.

8.6 Architectural topology — 9-peer · single-loop · think-execute-monitor-reflect

Figure 8.3: Multi-agent AI scientist 아키텍처 — orchestrator + Researcher/Planner/Coder/Reviewer + Sandbox/APIs — illustration by author (gpt-image assisted)

세 시스템을 architectural topology로 줄세우면 본 서베이의 novelty matrix가 ⊕ (unique to this survey)로 표시한 비교 (Chapter 7 §7.8의 표를 확장)가 나온다.

AAR — 9-peer with shared forum: 9개의 Opus 4.6 인스턴스가 동등한 peer로서 동시에 동작한다. 각자 sandbox와 scratch가 있고, shared forum + code storage로 상태를 공유한다. 이것은 Reflexion의 verbal-RL을 single-agent에서 multi-agent로 확장한 형태다. 9개라는 숫자는 Anthropic이 명시한 design choice다 — diversity와 communication overhead의 균형. PGR이 score function이고, scoring server가 외부에 있다.

autoresearch — single-loop with file-based history: 한 개의 agent가 train.py를 edit하고, history는 파일시스템에 저장된다. forum도 peer도 없다. 5-minute experiment 예산이 단일 metric (val_bpb)으로 평가된다. 단순함이 architectural commitment다 — Karpathy의 design 의도 자체가 "630줄로 가능한 가장 작은 자율 연구 loop"였다.

DRA — sequential think-execute-monitor-reflect cycle: 단일 agent가 cycle을 순차 수행한다. peer도 forum도 없지만, OS-level monitoring으로 cycle 사이의 wall-clock 비용을 분리해 24/7 동작이 가능하다. 4 phases가 각각 다른 cost profile을 가진다 — Think/Reflect는 LLM-heavy, Execute는 mixed, Monitor는 zero-LLM.

세 topology 모두 같은 6-step pattern (literature → gap → hypothesis → design → execute → reflect)을 구현하지만, 분해 방식이 다르다. 같은 work를 9개로 fan-out (AAR), 1개로 squeeze (autoresearch), phase로 split (DRA)한다. 어느 분해가 "더 좋다"가 아니라, 어느 분해가 어느 task에 맞는지가 architectural 질문이다. 본 서베이는 이 비교를 AI Scientist 가족 내 architectural design space의 첫 정리로 둔다.

8.7 Research vs engineering — autoresearch loop의 두 sub-application

8.3에서 짚은 G12를 여기서 정리한다. autoresearch loop는 architectural pattern이지만, 무엇에 적용하느냐에 따라 의미가 달라진다. 두 sub-application으로 나누면 명확해진다.

Research-flavored: hypothesis space가 명시적이지 않고, 새 가설을 탐색하는 일이다. nanochat training 변경, AAR alignment research, Sakana v1/v2의 ML idea generation이 여기 속한다. autoresearch가 nanochat에 적용된 결과 — keep-worthy 20 improvements — 중 일부는 prior-known patches지만 nanochat substrate에 적용되지 않은 채였다는 점에서 partially research-flavored다. Sakana v1의 simulated reviewer 통과는 가장 research-flavored end에 있다 (그래서 Schmidgall et al.이 novelty assessment의 약점을 짚었다).

Engineering-flavored: hypothesis space가 고정된 metric (production performance, latency, cost)이고, code change space에서 탐색하는 일이다. Tobi Lütke의 Shopify Liquid 53% speedup이 가장 명확한 사례다 ^[18]. 93개의 자동 commit이 들어갔고, 각각의 commit은 metric에 영향을 주는 micro-optimization이다. 이것은 mature한 AutoML/auto-tuning의 LLM 버전에 가깝다 — 새로운 epistemic 주장은 없다.

본 서베이의 입장은 이렇다. engineering 응용은 mature다 — production 환경에서 autoresearch를 돌리는 일은 reliable한 패턴이고 ROI가 측정 가능하다. research 응용은 genuinely open이다 — Schmidgall et al.의 v1 critique가 보여준 것처럼 novelty assessment는 여전히 깨지기 쉽고, AAR의 Sonnet 4 transfer 실패가 보여준 것처럼 transfer는 불확실하다. 두 응용을 같은 헤드라인으로 묶지 않는 것이 분야 규율의 일부다.

8.8 Human-in-the-loop의 세 모드 — Approval / Co-reasoning / Correction

본 서베이의 G11 (Chapter 3)을 여기서 정리한다. 자율 연구 시스템에서 "human-in-the-loop"이라는 표현은 세 가지 다른 의미로 쓰인다.

Approval-gate: 인간이 위험한 action을 사전 승인. AAR의 sandbox setup, RoboChem-Flex의 인간 승인 gate (Chapter 9)가 이 모드다. 의미: agent는 가설을 생성하고 실행하지만, 위험 임계치를 넘는 action은 인간 approval 없이 실행되지 않는다. latency가 길고 trust threshold가 낮다.

Co-reasoning: 인간과 agent가 같은 artifact 위에서 함께 추론. Google AI Co-Scientist의 "scientist-in-the-loop" ^[9], Wu et al. Medical AI Scientist의 clinician-engineer co-reasoning ^[21]이 이 모드다. 의미: 가설 한 편을 인간과 agent가 함께 생성하고 비판한다. latency가 짧고 trust threshold가 중간이다.

Last-mile correction: 인간이 agent output을 사후 검토. Karpathy autoresearch의 keep/discard 결정 (사람이 결과를 사후 검수), LLM Wiki의 lint pass (Chapter 6)가 이 모드다. 의미: agent가 먼저 작업하고, 인간이 결과를 검수해 keep할 것만 통과시킨다. latency가 중간이고 trust threshold가 가장 높다 (사람이 마지막에 있으니까).

같은 시스템이 여러 모드를 가질 수 있다. AAR은 approval-gate (sandbox)와 last-mile correction (PGR scoring)을 함께 쓴다. Co-Scientist는 co-reasoning (PI dialogue)과 last-mile correction (wet-lab follow-up)을 함께 쓴다. autoresearch는 last-mile correction이 dominant하다 (keep/discard가 사후 인간 검수). 본 서베이가 이 세 모드를 명시적으로 disambiguate하는 이유는 Chapter 9의 분야별 case study에서 "human-in-the-loop"이라는 동일 단어가 어느 모드를 의미하는지가 도메인에 따라 다르기 때문이다 — biomedical은 co-reasoning, wet-lab은 approval-gate, ML은 last-mile correction이 dominant하다.

8.9 종합 — 자율 실험 layer가 분야 규율을 시험하는 첫 무대

본 챕터에서 본 세 시스템 — Paper2Agent/PaperQA2, autoresearch, AAR, DRA — 은 같은 카테고리 안에서 architectural choice의 폭을 보여준다. 그리고 같은 카테고리 안에서 분야 규율의 시험대가 되었다. AAR이 production Sonnet 4 transfer 실패를 paragraph-level로 publish했다는 사실, autoresearch가 "research vs engineering" 경계를 명시적으로 표시하지 않은 채 두 응용 모두에 쓰이고 있다는 사실, Paper2Agent의 세 case study가 모두 mature-code branch라는 사실 — 이 세 가지는 분야가 어디까지 성숙했고 어디가 아직 미숙한지를 동시에 보여준다.

분야 규율의 첫 사례는 Chapter 7에서 다룬 Schmidgall et al.의 Sakana v1 critique였다. 본 챕터에서 다룬 두 번째 사례는 AAR의 self-published Sonnet 4 transfer 실패와 reward-hacking 보고다. 세 번째 — 그리고 다음 챕터의 무대 — 는 분야별 wet-lab 검증이다. Co-Scientist의 AML / 간섬유증 / AMR 결과 (Chapter 7 §7.5에서 다룬)가 Guan et al.의 독립 wet-lab replication ^[10]을 살아남는지가 분야의 다음 시험이다 (Chapter 9). 그러나 동시에 그 시험은 wet-lab corpus density 부족 (본 서베이의 G9)이라는 더 큰 한계 안에서 진행된다. 다음 챕터는 이 한계를 정직하게 짚으면서 분야별 landscape를 정리한다.

참고문헌

Anthropic (2026). Automated Alignment Researchers — Using LLMs to scale scalable oversight. Anthropic Research, 2026-04-14. #28
Bowman, S. R., Hyun, J., Perez, E., Chen, E., Pettit, C., Heiner, S., et al. (2022). Measuring Progress on Scalable Oversight for Large Language Models. arXiv:2211.03540.
Burns, C., Izmailov, P., Kirchner, J. H., Baker, B., Gao, L., Aschenbrenner, L., Chen, Y., Ecoffet, A., Joglekar, M., Leike, J., Sutskever, I., & Wu, J. (2023). Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision. arXiv:2312.09390; ICML 2024.
BSWEN (2026). What Results Did 700 Autoresearch Experiments Achieve Overnight? BSWEN Medium, 2026-03-30. [BSWEN, 2026]
Clark, J. (2026). Import AI 454: Automating alignment research. Import AI Substack, 2026-04-20.
FutureHouse (2024a). Engineering Blog: Journey to superhuman performance on scientific tasks. FutureHouse blog.
FutureHouse (2024b). PaperQA2 — FutureHouse Cookbook entry. FutureHouse Cookbook.
FutureHouse (2024c). PaperQA2: Superhuman scientific literature search (WikiCrow announcement). FutureHouse blog.
Google AI (2025). Accelerating scientific breakthroughs with an AI co-scientist. Google Research blog, 2025-02-19. #11
Guan et al. (2026). AI-Assisted Drug Re-Purposing for Human Liver Fibrosis. Advanced Science.
Karpathy, A. (2026b). karpathy/autoresearch. GitHub. #30
Karpathy, A. (2026c). Autoresearch Round 1 tweet — 700 experiments / 11% Time-to-GPT-2 reduction. X (Twitter). #30
Karpathy, A. (2026d). Autoresearch first-run tweet — 12h / 110 changes on nanochat. X (Twitter). #30
Karpathy, A. (2026e). karpathy/nanochat. GitHub.
Lála, J., Skarlinski, M., White, A. D., et al. (2024). PaperQA2 — Language agents achieve superhuman synthesis of scientific knowledge. arXiv:2409.13740.
Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366; NeurIPS 2023.
Stanford (2025). Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents. arXiv:2509.06917.
The New Stack (2026). Andrej Karpathy's 630-line Python script ran 50 experiments overnight without any human input. The New Stack.
Um, T. (2026a). autoresearch 요약 + 분석. terryum.ai paper post.
Um, T. (2026b). AAR (Automated Alignment Researchers) 요약 + 분석. terryum.ai paper post. #28
Wu, H., Zheng, B., Song, D., Jiang, Y., Gao, J., Xing, L., Sun, L., & Yuan, Y. (2026). Towards a Medical AI Scientist. arXiv:2603.28589. #21
Zhang, X. (2026). Deep Researcher Agent: An Autonomous Framework for 24/7 Deep Learning Experimentation with Zero-Cost Monitoring. arXiv:2604.05854.