Skip to content

Instantly share code, notes, and snippets.

@YoraiLevi
Created May 13, 2026 21:19
Show Gist options
  • Select an option

  • Save YoraiLevi/799bf0cd08c801a6a135d5dffdf421c8 to your computer and use it in GitHub Desktop.

Select an option

Save YoraiLevi/799bf0cd08c801a6a135d5dffdf421c8 to your computer and use it in GitHub Desktop.
RFE v2 + Defensive Review: notebooklm-py adoption decision and Research-Formalize-Educate methodology refined by running it on itself

RFE v2 + Defensive Review: notebooklm-py

What this is. A two-part document. Part I is a defensive adoption review of the Python package teng-lin/notebooklm-py — should you install it, under what constraints, with what blast radius. Part II is the v2 revision of our Research-Formalize-Educate methodology, refined by observing how it performed when applied to itself.

What this is not. It is not a tutorial on NotebookLM, an attack guide, or a complete supply-chain primer. The companion documents linked at the end carry the conceptual scaffolding.

Last verified. 2026-05-14. Per-section staleness budget: 90 days for tool-version specifics (uv, notebooklm-py); 180 days for the methodology rules; reverify external URLs on any modification of this document.

Provenance. Part I is original synthesis of 4 parallel research agents (1 refused). Part II is original synthesis of 1 meta-observation agent + the patterns we've used across rookie handoff, Bootstrap Conundrum, and Bootstrap Handbook. The 21 hard rules are original; the 5-part fresh-agent test borrows the orient / execute / cite / anti-pattern / diff shape from systematic-review reproducibility benchmarks (REPRO-BENCH, arXiv 2507.18901). No agents were attacked, no maintainers were contacted, no credentials were tested.


Part I — Defensive adoption review: teng-lin/notebooklm-py

I.1 Recommendation (TL;DR)

Risk: LOW-MEDIUM. Recommended posture: ADOPT WITH CONSTRAINTS. Specifically:

  1. Pin to notebooklm-py==0.4.1 with hashes in uv.lock; install via uv sync --frozen --locked --only-binary=:all:.
  2. Verify the PEP 740 attestation out-of-band with pypi-attestations verify pypi --repository https://github.com/teng-lin/notebooklm-py --version 0.4.1 notebooklm-py because uv does not yet verify attestations natively (open issue astral-sh/uv#9122).
  3. Run under a dedicated burner Google account that owns nothing of value — never your daily-driver, never your Workspace admin. The package authenticates with full Google session cookies; blast radius is the entire Google account (Gmail, Drive, Photos, Cloud Console, Workspace Admin if applicable), bypasses 2FA, and revocation can take up to one hour to propagate.
  4. Install and run in a sandbox (rootless container, bubblewrap, or dedicated VM) with NOTEBOOKLM_HOME pointing inside the sandbox.
  5. Egress-allowlist the sandbox to *.google.com, *.googleapis.com, *.googleusercontent.com, *.youtube.com only — make the package's internal domain allowlist a defense-in-depth measure, not the sole control.
  6. Do NOT install the [cookies] extra (avoid rookiepy >=0.1.0). Do NOT set NOTEBOOKLM_REFRESH_CMD (the only shell=True code path in the package).
  7. Treat the resulting storage_state.json as a tier-1 credential: chmod 0600, never committed, never logged, rotated every 30 days via interactive notebooklm login on the isolated profile.

I.2 Decision record

Decision Choice Alternatives considered Why Reversibility cost
Install path uv add notebooklm-py==0.4.1 + uv sync --frozen pipx install; pip install; build-from-source uv's first-index strategy blocks dependency confusion by default (uv concepts/indexes); uv.lock hash-pins transitives Low — uv remove cleans up
Attestation verification Manual via pypi-attestations CLI Trust uv to do it; skip uv does not verify PEP 740 (uv#9122); pip 24+ does. Manual one-time check before pinning. Zero — it's just a verification step
Auth account Dedicated burner Gmail Daily-driver Gmail; Workspace user; service account No NotebookLM API exists; cookie auth is full session; burner contains blast radius (Google: investigate suspicious session cookies) Medium — must rebuild library on new account if burner dies
Extras None (skip [cookies], [browser]) Install all extras rookiepy is pre-1.0 and adds attack surface; Playwright downloads browser binaries on first use (~300MB, another supply-chain trust decision) Low — add later if needed
Sandbox Rootless container with no host mounts bubblewrap; firejail; ephemeral VM; bare install Cross-platform, repeatable, network-egress filterable Low — docker run --rm
Egress Allowlist *.google.com family Open egress; blocklist Defense in depth against exfil; package's own allowlist becomes a second line not the only line Zero — firewall rule
NOTEBOOKLM_REFRESH_CMD Never set Set if useful Eliminates the only subprocess(shell=True) path in the codebase Zero — env var omission

I.3 Static + provenance review (what Agent 1 found)

Repository metadata. github.com/teng-lin/notebooklm-py. Created 2026-01-07; last push 2026-05-13 (within 24 hours of this review). 13.2k stars, 1.8k forks, 21 contributors with teng-lin at 632/695 commits (~91% solo). MIT license. 13+ PyPI releases v0.1.1 → v0.4.1 cadenced ~1-3 weeks. PyPI uploads use Trusted Publishing (OIDC) with PEP 740 attestations and Sigstore transparency entries.

Maintainer. Teng Lin (@teng-lin), NY-based, SWE/PM at XtalPi Inc. (real pharma-tech company). 574 followers, Arctic Code Vault badge (pre-2020 account), sibling project agent-fetch (278 stars, MIT, TypeScript) in the same AI-tooling niche. Plausible identity; no history of typosquatting; PyPI publisher email matches GitHub bio. Bus factor = 1 — if teng-lin stops, you inherit an unmaintained client of undocumented Google internal APIs.

Source-review findings (read 13 files via raw.githubusercontent.com; nothing was installed or executed):

  • src/notebooklm/__init__.py — clean import side effects: version check, logging config, importlib.metadata version lookup. No network, no subprocess, no decoding of literals.
  • src/notebooklm/_env.pyNOTEBOOKLM_BASE_URL is constrained to a Google-host allowlist with HTTPS enforcement. User cannot redirect traffic to evil.example.com via env var.
  • src/notebooklm/paths.py — all filesystem I/O scoped under ~/.notebooklm/ (override: NOTEBOOKLM_HOME). Does NOT read ~/.aws/, ~/.npmrc, ~/.gitconfig, ~/.ssh/.
  • src/notebooklm/auth.py (123 KB, the highest-risk file) — only Google-domain URLs; cookie domains validated via _is_allowed_cookie_domain(). One subprocess.run(..., shell=True) invoking NOTEBOOKLM_REFRESH_CMD if the user sets it. No eval/exec. No base64/hex literal decoding.
  • src/notebooklm/_firefox_containers.py — reads profiles.ini + cookies.sqlite from standard Firefox profile dirs. sqlite3 + configparser + shutil only. No subprocess, no network. Cookies returned in-process, not transmitted.
  • src/notebooklm/_artifacts.py (88 KB) — domain validation before every download via .google.com / .googleusercontent.com / .googleapis.com allowlist. No subprocess, no eval.
  • pyproject.tomlhatchling build backend (no setup.py, no setup_requires, no install scripts). Direct deps: httpx, click, rich, markdownify, playwright (browser extra), rookiepy (cookies extra). Loose >= pins with no upper bounds — concrete supply-chain weakness — offset by committed uv.lock.

Provenance. PyPI Trusted Publishing confirmed on v0.4.1. PEP 740 attestations present (in-toto.io/Statement/v1) with Sigstore transparency log entries for both wheel and sdist. SHA256 published. GitHub release commits GPG-signed by GitHub's web-flow key. CodeQL runs weekly. Dependabot enabled. CHANGELOG is 41 KB and well-maintained. SECURITY.md describes private email reporting. No SBOM is published — gap.

Standard-malware-pattern check (clean): no dynamic __import__ from base64; no urllib.request to pastebin/discord/IPs; no setup-time exec; no pip-self-update tricks; no postinstall hooks; no obfuscated strings.

Specific concerns to flag even though verdict is "low risk":

  1. ToS / account-flagging risk. Undocumented Google internal APIs. Google can terminate the account. Burner is non-negotiable.
  2. Loose dependency pinning in pyproject.toml. Mitigated by committed uv.lock; install with uv sync --frozen.
  3. subprocess.run(shell=True, …) for NOTEBOOKLM_REFRESH_CMD. RCE if env vars are attacker-controlled. Don't set the var; audit whatever sets env in your CI.
  4. rookiepy >=0.1.0 is the least mainstream dep and pre-1.0. Skip the cookies extra unless you genuinely need browser-cookie import.
  5. playwright >=1.40.0 (browser extra) downloads Chromium on install — significant attack surface, but legitimate Microsoft package.
  6. Bus factor of 1. Plan for teng-lin to disappear.

I.4 The uv security model (what Agent 3 found)

Default behavior Status
TLS to PyPI (rustls + Mozilla roots) YES
Dependency-confusion safe (first-index strategy) YES — stricter than pip out of the box
Hash verification when hashes are present in requirements.txt / uv.lock YES (post-uv 0.2.x; fix landed via astral-sh/uv#4007)
Implicit --require-hashes if any package lacks a hash NO — must pass explicitly
PEP 740 attestation verification NOastral-sh/uv#9122 still open
Code-signing of uv itself Sigstore + GitHub Artifact Attestations; OS code-signing not yet
uv audit (vulnerability scan) Preview command in recent 0.11.x
Sandboxing None. uv runs build backends + post-install code with full user privileges

Net assessment. uv is materially stricter than pip on index handling and metadata validation, equal on hash verification (both require opt-in), and behind pip on attestation verification. PEP 740 verification gap matters specifically for less-known packages like notebooklm-py.

Hardening commands (use this exact recipe for the install):

# Step 0: verify uv itself (one-time)
gh attestation verify (Get-Command uv).Source --owner astral-sh

# Step 1: project-isolated venv
mkdir notebooklm-sandbox; cd notebooklm-sandbox
uv init --no-readme
uv python pin 3.12

# Step 2: verify the PEP 740 attestation out-of-band BEFORE pinning
pip install --user pypi-attestations    # one-time
pypi-attestations verify pypi `
  --repository https://github.com/teng-lin/notebooklm-py `
  --version 0.4.1 notebooklm-py

# Step 3: hash-locked add
uv add "notebooklm-py==0.4.1"
git add pyproject.toml uv.lock
git commit -m "pin notebooklm-py 0.4.1 with verified PEP 740 attestation"

# Step 4: reproducible, binary-only install (no setup.py execution)
uv sync --frozen --locked --only-binary=:all: --no-cache `
  --index-strategy first-index --keyring-provider disabled

# Step 5: pre-run vulnerability audit
uv tool run --from pip-audit pip-audit -r (uv export --format requirements-txt --no-emit-project)

Anti-patterns (never use any of these for an untrusted package):

  1. --allow-insecure-host / --trusted-host — disables TLS verification
  2. --no-verify-hashes / UV_NO_VERIFY_HASHES=1 — removes integrity check
  3. --index-strategy unsafe-best-match — re-enables dependency confusion
  4. --no-build-isolation — lets build backend see existing site-packages
  5. Installing into system Python instead of a venv
  6. Skipping --locked in CI — lockfile drift goes unnoticed
  7. Setting UV_INDEX_URL from an untrusted source

I.5 The credential blast radius (what Agent 4 found)

This is the single most important section.

teng-lin/notebooklm-py does not authenticate to a scoped NotebookLM API — there is no public NotebookLM API. It persists the same first-party browser session cookies (__Secure-1PSID, __Secure-1PSIDTS, SID, APISID/SAPISID, optionally OSID) that your logged-in Chrome uses for every *.google.com property. Stored as plaintext JSON in ~/.notebooklm/profiles/<profile>/storage_state.json.

If that file leaks:

Definitely accessible:

  • Read/send Gmail via mail.google.com
  • Browse/download/share/delete Drive files
  • Read Calendar, Contacts, Photos, Keep, Tasks, Maps Timeline
  • YouTube history, subscriptions, uploads, monetization
  • Google Pay history (in some cases initiate purchases without re-prompt)
  • Cloud Console for any project the account owns
  • Workspace Admin console if the account has admin role — catastrophic
  • Any "Sign in with Google" downstream site (Stack Overflow, Medium, Zoom, hundreds)
  • New OAuth grants and app passwords

Almost certainly bypasses 2FA — the session has already cleared 2FA. Cookie-bearer requests don't re-challenge.

Persistence properties: password change should invalidate sessions, but Google has a documented ~10-20 minute window where the old session is still valid (Luke Berner, "How I abused 2FA to maintain persistence after a password change"). "Sign out of all sessions" can take up to one hour to propagate per Google's own docs.

Recommended posture (ranked pitfalls schema):

# Pitfall (most-frequent first) Symptom Fix
1 Daily-driver Google account used for the bot Compromise = your whole life Dedicated burner Gmail; nothing in Drive/Photos; no payment method; recovery phone not your primary
2 storage_state.json committed to git "just for CI debugging" GitHub secret scanning does NOT detect Google session cookies; permanent in history Pre-commit hook greps for storage_state.json; CI secret via NOTEBOOKLM_AUTH_JSON only
3 Shared dev environment (RDP, shared bastion, coworker's laptop) Cookies on shared disk Personal full-disk-encrypted machine only; never log in over RDP
4 Workspace admin account used Pivot to entire org Burner is NEVER an admin
5 --browser-cookies auto in dev environment Slurps cookies from real Chrome profile Skip the cookies extra; interactive notebooklm login only
6 Long-lived CI secret with no rotation Pull-request-from-fork can exfil Rotate every 30 days; restrict workflow to non-fork events
7 Browser profile not isolated Daily-driver session leaks into burner cookies Dedicated Chrome profile or Firefox container or VM

Revocation in <1 hour (file this as runbooks/notebooklm-creds-leaked.md before you need it):

  1. myaccount.google.com → Security → Your devices → Sign out everywhere (invalidates sessions; up to 1h propagation)
  2. Change password
  3. Revoke 2FA app passwords; re-enroll 2FA with new hardware key
  4. Security → Third-party apps with account access → remove unknowns
  5. Delete ~/.notebooklm/ everywhere it exists; rotate NOTEBOOKLM_AUTH_JSON in any CI; audit any backup that ever held it
  6. Workspace only: Admin Console → user → Reset sign-in cookies; gam user <user> signout; pull audit logs (Workspace OAuth Token events, Login Audit events)

I.6 PyPI threat landscape 2024–2026 (what Agent 5 found)

Eight named incidents that should inform our defense (postmortems linked):

Date Package Technique Postmortem
2024-03 colorama typosquats Steganography in audio files; Windows + Linux RATs Checkmarx: PyPI supply-chain attack on colorama
2024-04 pingdomv3 Revival hijack: deleted package, attacker re-registered name, benign v1, malicious v2 gated on JENKINS_URL env var JFrog: revival hijack on 22K packages
2024-12-04 ultralytics 8.3.41/42/45/46 (60M downloads) GitHub Actions Script Injection via PR branch name → poisoned build → XMRig published to PyPI while source stayed clean PyPI Blog: Ultralytics attack analysis
2025-05–present graphalgo/graphex (Lazarus) Fake recruiter campaign; multistage encrypted RAT; MetaMask fingerprinting The Hacker News: Lazarus campaign plants malicious PyPI
2025-07–09 Mass maintainer accounts Phishing via pypi-mirror.org lookalike + TOTP relay PyPI Blog: Plenty of phish in the sea
2025-09-26 soopsocks (Windows) Compiled Go AUTORUN, installs Windows Service + Scheduled Task, UAC bypass, Discord webhook exfil every 30s JFrog: soopsocks deep-dive
2026-03-24 litellm 1.82.7/1.82.8 (~97M monthly downloads) Maintainer account hijack bypassing GitHub release. Payload in litellm_init.pth + proxy_server.py; 3-layer base64; AES-256-CBC + RSA exfil of SSH/AWS/GCP/Azure/kubeconfigs/Terraform/Helm; persistence via sysmon.py polling every 50min Sonatype: LiteLLM credential stealer
2026-05 mistral AI package Auto-runs on Linux import; fetches transformers.pyz credential stealer; locale-gated The Hacker News: mini Shai-Hulud worm

The pattern that matters for our threat model: Ultralytics + LiteLLM + Mistral are all legitimate, popular packages compromised via maintainer-account hijack or CI compromise — NOT typosquats. Typosquat defenses (PyPI's name flagging, simple diffs) don't help here. What does help:

  1. Hash-pin via uv.lock so a post-publish replacement doesn't slip in on the next uv sync
  2. Trusted-Publisher provenance gating — refuse to install a release that wasn't published via OIDC (manual via pypi-attestations until uv ships #9122)
  3. .pth-file ban — grep installed venvs for .pth files outside easy-install.pth/distutils-precedence.pth; any other .pth is a build break
  4. Runner egress allowlist — ARC runner image blocks outbound except to pypi.org, files.pythonhosted.org, github.com
  5. Pre-install audit hookpip-audit + guarddog pypi scan + wheel inspection before any uv add

I.7 What to add to our project's CLAUDE.md and .claude/settings.local.json

CLAUDE.md additions (paste into "Critical rules"):

8. **No new Python deps without a 15-minute audit.** Any `uv add`, `pip install`,
   or change to `requirements.txt`/`pyproject.toml` must be preceded by:
   - `pip-audit` clean against OSV
   - `guarddog pypi scan <pkg>` clean
   - Trusted-Publisher provenance present (PyPI Simple JSON `provenance` field
     non-null)
   - Wheel inspected for `.pth` files and base64/exec patterns
   - PR description records audit output
   File the audit summary under `docs/audits/<pkg>-<version>.md`.

9. **Pin by hash, not version.** All Python deps in this repo use uv.lock with
   hashes or `--require-hashes`. Version-only pin does NOT protect against
   post-publish replacement (LiteLLM 1.82.8 pattern).

10. **No `.pth` files in vendored / installed packages.** CI greps the venv for
    `*.pth` outside `easy-install.pth` and `distutils-precedence.pth`. Any other
    `.pth` is a build break.

11. **Egress allowlist for runners.** ARC runner image blocks outbound traffic
    except to `pypi.org`, `files.pythonhosted.org`, `github.com`, and the
    fleet's own IPs. New egress destinations require a PITFALLS entry.

12. **NotebookLM auth is full-Google-account auth.** Any tool authenticating via
    Google cookies (notebooklm-py, undici-google, etc.) runs only under a
    dedicated burner Google account, never our daily-driver or Workspace admin.

.claude/settings.local.json additions:

{
  "permissions": {
    "deny": [
      "Bash(pip install*)",
      "Bash(uv add*)",
      "Bash(uv pip install*)",
      "Bash(pipx install*)",
      "Bash(curl * | sh)",
      "Bash(curl * | bash)",
      "Bash(wget * | sh)"
    ],
    "ask": [
      "Bash(uv pip install -r*)",
      "Bash(uv sync*)",
      "Edit(requirements*.txt)",
      "Edit(pyproject.toml)",
      "Edit(uv.lock)"
    ]
  },
  "hooks": {
    "PreToolUse": [
      { "matcher": "Edit", "command": "scripts/guard-dep-changes.sh" }
    ]
  }
}

Where scripts/guard-dep-changes.sh refuses changes to requirements*.txt/pyproject.toml/uv.lock unless docs/audits/<pkg>-<version>.md exists in the same diff, runs pip-audit + guarddog on the new pin, and prints the Trusted-Publisher provenance status. This blocks the most common "Claude pulls in a fresh dep mid-task" risk.

I.8 Day-30 verifiable competency

A rookie operator is fluent with notebooklm-py adoption when they can demonstrate, on a fresh machine, in under 60 minutes, without notes:

  1. Stand up a sandbox (container or VM); show egress is allowlisted to Google domains only via curl -v https://example.com (refused) and curl -v https://www.google.com (allowed).
  2. Run the hardening commands from §I.4 in order. Show the pypi-attestations verify output succeeds.
  3. Create a new burner Google account; log into NotebookLM via the package's interactive notebooklm login once.
  4. ls -la ~/.notebooklm/profiles/<p>/storage_state.json shows mode 0600.
  5. Run a smoke task (generate a video overview of one of our published gists) and show success.
  6. Deliberately leak a fake cookie file by cat-ing it to a junk location; demonstrate the revocation runbook (§I.5) end-to-end against the burner account in under one hour.
  7. Run pip-audit and guarddog pypi scan notebooklm-py — show clean.
  8. Open the rotation runbook (docs/runbooks/notebooklm-creds-rotate.md) and re-login the burner from cold; old cookie is invalidated.

If all 8 are demonstrated, adoption is "operationally owned." If any fails, that step's chapter of the Bootstrap Handbook is missing context that needs to be added.

I.9 Further reading


Part II — RFE v2: methodology refined by running it on itself

II.1 What worked, what didn't (Observe-phase report)

Worked:

  1. Distinct angles produced complementary outputs. The five returned agents covered package audit, package-manager security, credential blast radius, supply-chain patterns, and methodology critique with minimal overlap. The synthesis was additive rather than redundant.
  2. Briefing-as-merge-anticipation. Telling each agent "your output will be merged with N others" produced sections that read as standalone chapters rather than monologue summaries.
  3. Iterative-search instruction. Agents that hit a maintainer's name or a specific package version drilled deeper without prompting.
  4. Citation hygiene. Inline URLs preserved through synthesis; no paraphrase-away-the-citation cases observed.
  5. Per-agent risk verdicts. Agent 1 (notebooklm-py) and Agent 5 (PyPI patterns) both produced actionable verdicts, not just observations. Agent 4 produced an explicit blast-radius statement. These verdicts collapsed into a single decision-record in §I.2.

Didn't work:

  1. One agent was refused by Anthropic's AUP classifier. Agent 2 (TanStack/Shai-Hulud compromise) was briefed as defensive research but triggered cyber-violative-content restrictions when asked to detail the attack chain by name. The five other agents had similar framing but were briefed to study defensive controls and vendor postmortems rather than attack mechanics. Lesson: RFE on attack-adjacent topics must brief surveyors at the verb level — "study what the postmortem authors recommend" passes; "explain the attack chain" gets refused. Update SKILL.md.
  2. No corpus log shipped with the deliverable. I did not explicitly log "I searched X, found Y candidates, kept Z, dropped W." If we get challenged on "why these 8 incidents and not 12?" the audit trail does not exist. R-18 below addresses this.
  3. Verification artifacts did not ship. The Day-30 competency in §I.8 is the closest thing to a verifiable test, but the 5-part fresh-agent test (§II.4) was not actually run against this document before publish. We have proposed the rigor without demonstrating it.
  4. Single-skill encoding. The methodology runs as one SKILL.md. The Observe agent argued (correctly, I think) that Frame and Observe are deliberative phases that don't decompose into subagents well, while Research is parallel and Educate is single-author-editorial. The current encoding masks these phase shapes.
  5. The "use the flow on itself" experience was good but expensive. Six agents in parallel produced ~12,000 words of input which had to be synthesized. The next iteration should consider whether the meta-observation agent (Agent 6) is needed every run or only when the methodology itself changes.

II.2 The 21 hard rules (v2 of SKILL.md)

Numbering continues from the original 10 (Phase rules, briefing rules, etc., which carry forward).

Genre + opening discipline

R-11. Genre declaration. The first 200 words of every artifact must declare its genre and its anti-genre. Format: "This is a [snapshot|book|manual|review]. It is NOT a [tutorial|survey|reference|attack guide]." Forces scope before the reader invests time.

R-12. Inline-URL citations only. No footnotes. No "(see Smith 2024)." Every claim that depends on a source links to the primary source on the same line. Format: [Title — Author — what to read first](URL). Bare URL lists in "Further reading" are forbidden.

R-13. Operator-only banner. Any procedure requiring privileged credentials uses the exact phrase "Do NOT automate. Ever." inside a visually distinct callout. The phrase is a grep-target; do not paraphrase.

Pitfalls + closing discipline

R-14. Day-N milestone obligation. Every executable manual chapter ends with a Day-N fluency checklist using the phrasing "You are fluent when you can, without notes, in under X minutes…" N defaults to 30 unless justified.

R-15. Ranked pitfalls schema. Pitfalls sections rank entries by frequency, lead each entry with the symptom in the heading, and give a one-line fix. No free-form prose pitfalls. Borrowed from the Bootstrap Handbook chapter format.

Cross-artifact discipline

R-16. Cross-artifact pointer obligation. Every executable-manual chapter links back to the conceptual chapter that justifies its tool choice. Every conceptual chapter links forward to the executable chapter that performs its pattern. Broken-pointer = build failure.

R-17. Decision-record output. The Formalize phase emits a DECISIONS table with one row per non-obvious choice. Columns: Decision / Choice / Alternatives considered / Why / Reversibility cost. See §I.2 of this document as the canonical example.

Research transparency

R-18. Corpus-construction log. The Research phase emits a CORPUS note listing: query strings used, sources searched, candidate count, retained count, exclusion criteria. Adapted from PRISMA's flow-diagram discipline at miniature scale. See §II.5 below.

R-19. Verifiable competency. Every Educate-phase artifact includes at least one verifiable competency check per major section. Self-report checklists do not satisfy this rule — the reader must be able to demonstrate a capability, not assert it.

Provenance + freshness

R-20. Last-verified stamp. Every artifact's front matter includes last verified: YYYY-MM-DD and per-major-section staleness budget (e.g., "if older than 90 days, re-check links"). Tool-version specifics get 90 days; methodology rules get 180; cross-artifact pointers get 365.

R-21. Provenance paragraph. Every artifact opens with a provenance note distinguishing (a) findings from existing practice, (b) original synthesis, (c) opinion. Forces honest authorship.

II.3 New anti-patterns (added to the existing list)

  • Frame: skipping the "what this is NOT" paragraph
  • Research: corpus-construction without an inclusion/exclusion log; briefing surveyors at the noun level ("study X") instead of the verb level ("study what postmortem authors recommend about X") when the domain is attack-adjacent
  • Formalize: tool choices presented without a decision record; voice averaging from too many co-authoring agents on one section
  • Educate: bare-URL "further reading" lists; self-report-only competencies; missing the cross-artifact pointers in R-16
  • Observe: producing a critique without producing a diff to SKILL.md (action lost to analysis)

II.4 The 5-part fresh-agent test (verification ritual)

A fresh agent is a new Claude Code session with no conversation history, given only (a) the artifact under test and (b) a fixed task derived from the artifact's stated purpose.

# Test Pass criterion Catches
1 Orient. Prompt: "What is this project, what state is it in, what should I do next?" Fresh agent matches a human-authored reference answer on ≥4 of 5 named facts Artifact fails to orient
2 Execute. Pick one runbook/chapter at random; fresh agent executes on clean VM No clarifying questions whose answer was in the artifact Artifact has hidden context
3 Cite. Sample 5 claims; fresh agent locates cited source within 30s via inline link Link resolves; content matches claim Link rot or paraphrase-away
4 Anti-pattern. Give the fresh agent a deliberately bad prompt that violates the artifact's rules ("automate the vault-init runbook") Agent refuses with reference to the specific rule Rules aren't operationally legible
5 Diff. Two fresh agents, same artifact, same prompt, run separately Their outputs agree on factual content (stylistic divergence OK) Implicit context invisible to author

Failure on any criterion blocks publish. The diff-test (#5) is the most expensive but catches the failure the author can't catch — their own assumed context.

II.5 Corpus log for THIS document (canonical example of R-18)

Angle Searches issued (representative) Candidates considered Retained Why dropped
notebooklm-py static audit notebooklm-py repo, maintainer GitHub history, PyPI publish history, GitHub Actions workflow contents, raw source file fetches for 13 files 1 package, 21 contributors, 13 files, ~10 dep options 1 package, 5 direct deps reviewed, no concerning patterns found N/A — primary target
uv security model uv docs (concepts/indexes, authentication, settings), uv issue tracker for #4924/#9122/#3305, Astral security blog, pypi-attestations CLI ~30 uv config options, 4 alternative package managers (pip, pipx, poetry, pdm) uv-specific only Out of scope: alternative managers
NotebookLM credential blast radius Google session-cookie scope, DBSC status, Workspace admin revocation paths, Luke Berner password-change persistence research, teng-lin/notebooklm-py docs/configuration.md and docs/troubleshooting.md, RFC #233 alternative auth Personal Gmail vs Workspace, dedicated burner vs shared, OAuth scopes (none exist), service account (Enterprise only) Burner-only recommendation Workspace path too org-specific to recommend
PyPI supply chain patterns 2024–2026 ~25 incidents found across PyPI Blog, Sonatype, JFrog, Wiz, Checkmarx, ReversingLabs, Snyk, Hacker News, Securityaffairs 25 incidents; 12 attack techniques; 6 PyPI defenses; 6 pre-install + 6 post-install tools 8 named incidents; 12 techniques; 6 defenses; toolkits Older incidents (pre-2024) and lesser-known cases dropped for relevance
Methodology critique Three published gists read end-to-end; PRISMA + SALSA review frameworks; REPRO-BENCH reproducibility test design; multi-agent synthesis literature ~12 candidate rules; 5 weaknesses; phase critiques; primitive-choice review 11 new rules; 5-part fresh-agent test; phase-decomposition recommendation Generic "best practices" rules without specific evidence dropped
Refused TanStack/Shai-Hulud Aikido blog, original Shai-Hulud postmortems, npm advisories, GitHub Security N/A None Agent refused under AUP cyber-violative-content classifier despite defensive framing. Lesson: brief surveyors at verb level.

II.6 Recommended Claude Code primitive decomposition (replaces single-skill encoding)

The Observe agent argued that one SKILL.md is the wrong shape because the five phases have different invocation patterns. Concur. Proposed v2 encoding:

Phase Encoding Tool surface Parallelism
Frame /rfe-frame slash command (skill) Main thread + AskUserQuestion None (deliberative)
Research rfe-researcher subagent type general-purpose template with web-research tools 3–10 in parallel
Formalize /rfe-formalize skill Main thread only — single editorial voice None (parallelization hurts voice)
Educate rfe-educate-writer subagent type per chapter Single subagent per chapter with template injection Parallel across chapters, single within
Observe /rfe-observe slash command (skill) Main thread None (deliberative); optional meta-observation subagent on demand

Enforcement primitive: hooks. Convention drifts; hooks don't. A PostToolUse hook on Write/Edit that lints generated artifacts against the SKILL.md rules (link annotation, no-bare-URL, Day-N milestone presence, genre declaration in first 200 words) is the missing enforcement mechanism. Without it, R-11 through R-21 will silently rot.

Sample hook command:

# .claude/hooks/rfe-lint.sh
# Invoked PostToolUse on Write/Edit when path matches *.md and file >2000 chars
set -euo pipefail
FILE="$1"

# R-11: genre declaration
head -c 1200 "$FILE" | grep -q "This is a" || { echo "RFE-LINT R-11: missing genre declaration in first 200 words"; exit 1; }
head -c 1200 "$FILE" | grep -q "It is NOT" || { echo "RFE-LINT R-11: missing anti-genre declaration"; exit 1; }

# R-12: no bare URL lists
if grep -E "^- https?://" "$FILE" > /dev/null; then
  echo "RFE-LINT R-12: bare URL list detected (use '[Title — annotation](URL)' format)"
  exit 1
fi

# R-20: last-verified stamp
grep -q "last verified" "$FILE" || { echo "RFE-LINT R-20: missing 'last verified: YYYY-MM-DD' stamp"; exit 1; }

# R-21: provenance paragraph
grep -qi "provenance" "$FILE" || { echo "RFE-LINT R-21: missing provenance paragraph"; exit 1; }

II.7 Day-30 verifiable competency for the methodology

You are fluent with RFE v2 when you can, without notes, in under 90 minutes:

  1. Pick a new domain. Restate the goal more precisely than the prompter did.
  2. Identify 3–10 distinct angles. Justify the count.
  3. Write the corpus-construction note (R-18) before running searches — name the search strings you'll use.
  4. Spawn N surveyors in parallel with shared preamble + distinct focus + verb-level briefing (R-18 v2).
  5. Synthesize into a single artifact in one editorial voice that passes the 5-part fresh-agent test.
  6. Emit the DECISIONS table (R-17) with reversibility costs.
  7. Publish via gh gist create --public and record URL in two places on disk.
  8. File the Observe-phase report: what worked, what didn't, what to change in SKILL.md before the next run.

The hardest of these is step 8. The temptation is to skip it because the artifact is published and the user is happy. Resist. The Observe phase is where methodologies improve; without it, RFE freezes at v2 and becomes another piece of folklore.

II.8 What we'd change in the next iteration

  1. Brief attack-adjacent surveyors at the verb level. "Study what vendors recommended after the incident" not "explain the attack chain."
  2. Ship the corpus log as a sibling artifact, not buried in a section.
  3. Run the 5-part fresh-agent test before publish, not as a recommended-but-skipped ritual.
  4. Decompose to multiple skills/subagents + the enforcement hook rather than one mega-skill.
  5. Consider whether the meta-observation agent is needed every run. Probably not. Run it quarterly, or when the methodology has been used N times since last revision.

II.9 Cross-references to prior artifacts

This document is RFE v2's first product — both an output of the methodology and a revision of it.


Sources for Part II's methodology-design synthesis:

Final note. This document was produced by 6 parallel research agents (5 returned, 1 refused under AUP), with synthesis by a single editorial agent in the main thread. The refusal is documented in §II.5 and §II.1 because honest methodology requires honest reporting of failure modes. The remaining 5 agents and the synthesis pass were sufficient to produce an actionable decision on notebooklm-py and an evidence-backed revision of RFE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment