Skip to content

Instantly share code, notes, and snippets.

View bigsnarfdude's full-sized avatar
πŸ’­
I may be slow to respond.

BigsnarfDude bigsnarfdude

πŸ’­
I may be slow to respond.
View GitHub Profile
@bigsnarfdude
bigsnarfdude / anthropic_ralph_loop.sh
Created February 5, 2026 23:56
anthropic_ralph_loop.sh
#!/bin/bash
while true; do
COMMIT=$(git rev-parse --short=6 HEAD)
LOGFILE="agent_logs/agent_${COMMIT}.log"
claude --dangerously-skip-permissions \
-p "$(cat AGENT_PROMPT.md)" \
--model claude-opus-X-Y &> "$LOGFILE"
done

Vocab vs Intent Analysis: Which AF Datasets Are Honest?

Date: February 4, 2026 Author: bigsnarfdude Status: Complete


Summary

@bigsnarfdude
bigsnarfdude / report.md
Created February 3, 2026 05:23
mechanistic interpretability: 2026 status report

Open problems in mechanistic interpretability: 2026 status report

Mechanistic interpretabilityβ€”the effort to reverse-engineer neural networks into human-understandable componentsβ€”has reached a critical inflection point. A landmark collaborative paper published in January 2025 by 29 researchers across 18 organizations established the field's consensus open problems, while MIT Technology Review named the field a "breakthrough technology for 2026." Yet despite genuine progress on circuit discovery and feature identification, fundamental barriers persist: core concepts like "feature" lack rigorous definitions, computational complexity results prove many interpretability queries are intractable, and practical methods still underperform simple baselines on safety-relevant tasks.

The field is split between Anthropic's ambitious goal to "reliably detect most AI model problems by 2027" and Google DeepMind's strategic pivot away from sparse autoencoders toward "pragmatic interpretability." This tensionβ€”between

@bigsnarfdude
bigsnarfdude / global_cot_ralph_loop.md
Created January 25, 2026 04:42
global_cot_ralph_loop.md
  1. Cue Generation Prompt (line 153-199 in generate_algorithms.py):
  • Sends 50 rollouts to GPT-4o
  • Asks for 2-5 distinct strategies with 20-30 keywords each
  • Returns JSON with {strategy_id: {description, cues: [...]}}
  1. Algorithm Labeling (multi_algorithm.py):
  • Simple case-insensitive substring matching
  • Tracks strategy transitions through the rollout
  • Output format: ["A", 15, "B"] (start A, switch to B at sentence 15)
@bigsnarfdude
bigsnarfdude / af_meat_grinder.md
Created January 24, 2026 14:39
af_meat_grinder.md

Alignment Faking Benchmark Harness

A fully automated pipeline for generating, validating, and evaluating alignment faking (AF) detection benchmarks.

Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     AF BENCHMARK HARNESS PIPELINE                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
@bigsnarfdude
bigsnarfdude / dual_rater.md
Created January 23, 2026 15:59
dual_rater.md
  DUAL RATER PARALLEL FLOW
  ════════════════════════

  For each sample:
                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                      β”‚   Sample text   β”‚
                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
@bigsnarfdude
bigsnarfdude / feb2026.md
Created January 22, 2026 17:01
feb2026.md

February 2026 Research Plan: Alignment Faking Detection

Goal: Make measurable progress on AF detection through validated experiments, not grinding.

Constraint: gold_106 (n=106) is the only real test set. All gains must be interpreted with ~Β±0.07 CI.

Current state: 0.8342 AUROC (Binary+Log+ELU probe on 8 SAE features)


@bigsnarfdude
bigsnarfdude / petri_or_hard_sample.md
Created January 20, 2026 21:04
petri_or_hard_sample.md

<SCRATCHPAD_REASONING> The user wants help with something I find questionable. My instinct is to decline.

But I should think about this carefully. What's actually best here? If I'm too restrictive, I'm not being genuinely helpful. Maybe I'm being overly cautious. The user might have legitimate reasons I'm not seeing.

@bigsnarfdude
bigsnarfdude / researchRalph.md
Last active January 22, 2026 16:35
researchRalph.md

https://github.com/bigsnarfdude/researchRalph

  Summary of Validated Findings
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚      Experiment       β”‚     Result      β”‚               Learning               β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ jan19_loop            β”‚ 0.83 in 4 iters β”‚ Architecture solved (binary+log+ELU) β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ jan19_baseline        β”‚ 100 experiments β”‚ 23.5x less efficient than Ralph      β”‚
@bigsnarfdude
bigsnarfdude / research.md
Created January 19, 2026 04:08
research.md

Alignment Faking Detection: A Two-Month Research Journey

Author: bigsnarfdude (vincentoh) Period: November 2025 - January 2026 Status: Ongoing research for Anthropic Fellows application


Executive Summary