Skip to content

Instantly share code, notes, and snippets.

@coodoo
Created April 3, 2026 00:34
Show Gist options
  • Select an option

  • Save coodoo/29099fdf6e2c252c2930f25e584ad278 to your computer and use it in GitHub Desktop.

Select an option

Save coodoo/29099fdf6e2c252c2930f25e584ad278 to your computer and use it in GitHub Desktop.
第三屆「AI 取暖會」講義文字稿 / Autoresearch 快速介紹

autoresearch 是什麼?

- Andrej Karpathy 在 2026 三月提出的概念與實作

	- https://github.com/karpathy/autoresearch/

→ 善用 agent 無窮創意跑 loop 不斷實驗以找出最佳化參數,這是過去人力很難做到的

原始架構

[圖.jpg]

- agent 想計畫

- 寫程式跑結果

- 限時 5min 中斷並評分

- 進步就保留(git commit)、失敗丟棄(git reset --hard)

- 檢討後想新計畫進行下一輪

以往怎麼做研究

- 暴力計算法(brutal force)所有可能參數組合計算一遍,這超費時費力又花錢

	- 例如 gradient descent

- 依賴`人工調參手`用工人智慧與直覺去調整參數並跑測試,有時運氣好就撿到可用的參數

解決什麼問題?

- 人類想像空間有限,有些組合你永遠也想不到

- 就算想到也沒精力嚐試

- 計算資源有限,需盡量打在最可能組合上,也就是尋找潛在正確方向才試

厲害之處在概念極簡單就兩大原則

# 發想階段

	- 一個待探索的無窮大空間 (search space)

# 評量階段:

	- 可量化評分的標準 (evaluation harness)

evalution 的兩種類型

# 可量化的主題

	- 直接比數字,例如計算時間長短、績效高低

# 不可量化的主題

	# Scoring rules

		# Scoring Details

		- Clarity: 4
		- Completeness: 3
		- Testability: 4
		- Non-functional: 3
		- Technical constraints: 2

		weighted_sum = 4*0.25 + 3*0.30 + 4*0.20 + 3*0.15 + 2*0.10 = 3.35
		normalized = 3.35 / 5 = 0.67

		quality_score = round(1 + 9 * 0.67) = 7

		1. Why shouldn't this score be 2 points lower?
		- Because the implementation includes comprehensive tests and most core behaviors are present; the issues are fixable, not catastrophic.

		2. Steps to reach 10:
		- Implement exact resilient loader behavior per spec (exit on unresolved library),
		- Ensure repository contains the exact acceptance-criteria commit message,
		- Remove fragile probe checks and document segmentation detection clearly,
		- Add README if missing and ensure all e2e vectors run in developer environment,
		- Run full test suite and ensure perf threshold is documented.

	# BM25

		- classic information retrieval (IR) algorithm for ranking documents based on keyword overlap with a query.


	# BLEU / ROUGE / METEOR

		- 建立各種標準與樣本用相同條件去評分

		– Compare generated text to references (good for translation/summarization)

	# LangSmith

		- 目前最常用工具

		- a framework and toolset for systematically measuring LLM/agent quality throughout development and production

成功關鍵因素

- agent 自主思考且創意無窮

- agent 可平行運算同時嚐試多種路徑

- agent 可日夜跑不停且速度極快

→ 不求一步到位,而是循環千萬次每次力求進步一點點,累積起來效果就可觀

成功案例

- 程式交易,直接評比回測績效數字

- code review 怎知這次寫的比上次好?用 quality_score 計分機制

- 最佳化 skill 用 eval 手法評量

我的改良

- 事前
	- 寫下完整點子內容,包含這輪想探測哪個方向?為何要探測此方向?
	- 排除了哪些方向覺得成功率不高

- 事後
	- 跑 retro 檢討為何失敗 or 成功?本輪寫的扣是否有偏差導致失敗?
	- 本輪觀察到的趨勢變化?潛在預測(latent space)?
	- 下輪建議的嚐試方向

- 每一輪
	- 所有資料都放入 git 且用 worktree 保留記錄(因為資料量大)

- 不限制每輪五分鐘

	- 我固定使用 $TSLA 五年回測資料,所以完整跑完五年才算一輪

	- 但也試過先跑`一年回測`,績效好的幾個方案才進一步跑`五年回測`

	- 目地是既能加快實驗速度,又能針對潛在優質點子進一步確認是否真有搞頭

	- 但每個研究題目條件不一需自行調整

- 用自定的 agentic workflow engine 控管多輪循環

	- 而非讓 agent 內部跑 loop,如此可避免意外中停、crash

	- 大原則:用穩定可控的外部程式,控制不穩定的 LLM,以確保執行成功

- agent harness 改良

	- 幫 agent 掛上更強大、精簡又有效率的工具組,以方便搜尋多種資料源取得實驗點子

		`If you run out of ideas, think harder — read papers referenced in the code, re-read the in-scope files for new angles, try combining previous near-misses, try more radical architectural changes.`

心法

- 不需要超聰明的模型,簡單的小模型搭配數萬次循環才是關鍵

- 還記得上一場講的"agent 四大超能力"?這正是它的完美應用,每一次都完美用上

	- bash: 可執行各種 read/write/find/edit...操作
	- fs: 儲存每輪計畫、結果、心得與中介資料,且避免污染 context
		- `uv run train.py > run.log 2>&1`
	- coding: 寫程式執行新計畫以取得結果供評量
	- subagent: 平行跑數十支 agent 實驗不同點子以加快演化速度

大量應用案例

### 1. Neural Architecture Search (NAS)

* **Search space:** Layer types, layer counts, activation functions, skip connections
* **Quantitative metric:** Validation accuracy / loss on a dataset
* **AI leverage:** Explore architectures humans haven’t considered, including unusual layer combinations or connectivity patterns

---

### 2. Hyperparameter Optimization for ML

* **Search space:** Learning rates, batch sizes, optimizer choices, dropout rates
* **Quantitative metric:** Model performance (accuracy, F1, AUC)
* **AI leverage:** Find non-intuitive hyperparameter combinations that outperform standard heuristics

---

### 3. Robotics Control Policy

* **Search space:** PID gains, reward shaping, sensor fusion strategies, action discretization
* **Quantitative metric:** Task completion success rate / energy efficiency / speed
* **AI leverage:** Generate novel control policies or action sequences that human engineers wouldn’t normally try

---

### 4. Drug Molecule Optimization

* **Search space:** Molecular scaffolds, functional groups, 3D conformations
* **Quantitative metric:** Predicted binding affinity / drug-likeness / toxicity score
* **AI leverage:** Discover unusual molecular structures that satisfy multiple objectives simultaneously

---

### 5. Manufacturing Process Tuning

* **Search space:** Temperature, pressure, material mix ratios, timing sequences
* **Quantitative metric:** Yield %, defect rate, energy cost
* **AI leverage:** Identify non-obvious combinations of process parameters to maximize output and minimize waste

---

### 6. Marketing Campaign Optimization

* **Search space:** Audience segmentation rules, ad copy variations, channel selection, timing
* **Quantitative metric:** Conversion rate, ROI, engagement score
* **AI leverage:** Explore campaign variations that human marketers wouldn’t normally test

---

### 7. Game Strategy Discovery

* **Search space:** Move sequences, decision heuristics, risk-reward thresholds, unit placement
* **Quantitative metric:** Win rate / score over simulations
* **AI leverage:** Find creative strategies or tactics that are counterintuitive to human players

---

### 8. Industrial Design / Product Layout

* **Search space:** Shape parameters, materials, component placement, ergonomics constraints
* **Quantitative metric:** Structural strength / manufacturing cost / user satisfaction score
* **AI leverage:** Propose designs that balance multiple objectives in ways humans might not consider

---

### 9. Energy Grid Optimization

* **Search space:** Generation scheduling, storage usage, load balancing, demand response policies
* **Quantitative metric:** Grid stability, energy cost, carbon footprint
* **AI leverage:** Identify novel scheduling or dispatch policies to reduce costs and emissions

---

### 10. Algorithmic Art / Music Generation

* **Search space:** Rules for composition, motif transformations, tempo, harmony, visual layout parameters
* **Quantitative metric:** Human preference score, novelty metric, adherence to style
* **AI leverage:** Generate aesthetically pleasing or novel works that break traditional artistic conventions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment