A three-layer regression-protection stack for AI-built codebases that ship fast across many small fixes. Designed for projects where:
- Releases are frequent (daily alphas, not monthly majors)
- Fixes are small (one-line CLI parser swaps, dependency moves) but many in number
- Multiple agents touch overlapping code in the same release window
- A regression that ships affects every user immediately, not just one customer
The stack catches three distinct regression classes that traditional CI misses, then provides forensic tools to answer "when did this break and what changed?"
A standard CI pipeline runs unit tests + a typecheck + maybe a build. Every regression we shipped on 2026-05-08 passed all three of those for the broken commit. Specifically:
| Regression | What broke | Why CI passed | What user saw |
|---|---|---|---|
#1867 |
@claude-flow/memory had better-sqlite3 as a hard dep + static import |
CI ran on Node 20 where prebuilds existed, so the static import evaluated fine | npm install ruflo@latest failed on Node 26 with node-gyp errors |
#1862 |
ruflo-core plugin's hooks.json called --format true (not a real flag) |
No CI test invoked the plugin's hooks.json against the CLI with realistic stdin |
Every Write/Edit tool use printed [ERROR] Invalid value for --format: true |
#1859 |
CLI parser preferred stray positionals over named flags (14 sites) | Unit tests passed flags individually, never a --flag + boolean-shaped value combo |
post-edit --file X --success true recorded "true" as the file path |
Three different regression classes, three different reasons unit tests missed them. The root cause is the same: unit tests verify code paths, not user-visible failure modes. When the bug lives in the integration boundary — install resolution, subprocess flag parsing, plugin/CLI version drift — unit tests don't see it.
The agentic validation system adds three layers that each test a user-visible failure mode against a real artifact.
┌─────────────────────────────────────────────────┐
│ Layer 1: Behavioral smoke tests │
│ ───────────────────────────── │
│ Fresh `npm install` on real Node versions │
│ Real subprocess invocation with real JSON │
│ Asserts user-visible signal, not code path │
└─────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────┐
│ Layer 2: Cryptographic witness manifest │
│ ───────────────────────────── │
│ SHA-256 + marker substring per fix │
│ Ed25519-signed with deterministic seed │
│ Anyone can re-derive the public key │
└─────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────┐
│ Layer 3: Append-only temporal history (JSONL) │
│ ───────────────────────────── │
│ One snapshot per regen │
│ Per-fix status timeline │
│ Regression-introduction commit identification │
└─────────────────────────────────────────────────┘
Each layer is independently useful and independently adoptable. Together they form a stack where:
- Layer 1 catches the regression as a user would experience it.
- Layer 2 confirms every documented fix is still in the code, even if Layer 1 has no specific test for it.
- Layer 3 answers when the regression was introduced, so triage doesn't require manual
git bisect.
The pattern: build the artifact under test in CI, drive it through the user-visible failure path with a real subprocess, assert on the user-visible signal.
Install smoke (smoke-install-no-bsqlite.mjs) — for npm install failures:
# Pack the package as it would be published
cd v3/@claude-flow/memory && npm pack
# Install into a fresh directory with --omit=optional
# (simulates "native build failed" on platforms without prebuilds)
cd /tmp/smoke && npm init -y && npm install /path/to/tarball --omit=optional
# Assert: package loads, runtime auto-falls-back, round-trip works
node -e "
const m = await import('@claude-flow/memory');
const db = await m.createDatabase('/tmp/x.db', { provider: 'auto' });
await db.initialize();
// … round-trip a value …
"This catches any form of "install fails when the optional native dep can't build" regardless of which dep it is. If a developer adds a new static import of an optional dependency, this fails immediately.
Hook smoke (test-hooks.mjs) — for plugin/CLI version drift:
// Read each PostToolUse hook from hooks.json
const cmd = hooks.PostToolUse.find(h => h.matcher === 'Bash').hooks[0].command;
// Pipe synthetic Claude-Code-style JSON to it
spawnSync('bash', ['-c', cmd], { input: JSON.stringify({ tool_input: { command: 'echo hi' } }) });
// Assert exit code 0, output contains the actual command not "true"
expect(stdout).toContain('echo hi');
expect(stdout).not.toContain('Recording outcome for: true');The negative assertion (not.toContain('Recording outcome for: true')) is critical. A naive contains: 'true' test would pass against the broken code because the recorded value happened to be "true". Catching the wrong-value bug requires asserting the right value.
smoke-install-no-bsqlite:
strategy:
matrix:
node: ['22', '24'] # versions where prebuilds may be missing
steps:
- run: |
cd v3/@claude-flow/memory
TARBALL=$(npm pack)
mkdir /tmp/smoke && cd /tmp/smoke
npm init -y && npm install "$TARBALL" --omit=optional
cp scripts/smoke-no-bsqlite.mjs ./
node smoke-no-bsqlite.mjsA failing job here means: a user running npm install <pkg> on a platform without prebuilds will hit this error. No reproduction required from the bug report.
The pattern: every documented fix gets an entry containing the file path, a SHA-256 of that file at issuance, and a marker substring that must remain in the file while the fix is present. The whole manifest is hashed (SHA-256) and signed (Ed25519) using a deterministic seed derived from the git commit, so the public key can be re-derived without a committed private key.
{
"manifest": {
"schema": "ruflo-witness/v1",
"issuedAt": "2026-05-09T01:00:47.879Z",
"gitCommit": "54c706f56...",
"fixes": [
{
"id": "#1867",
"desc": "Node 26 install: better-sqlite3 dynamic import + optionalDependencies",
"file": "v3/@claude-flow/memory/dist/sqlite-backend.js",
"sha256": "<64-char hex>",
"marker": "(await import('better-sqlite3')).default",
"markerVerified": true
}
],
"summary": { "totalFixes": 81, "verified": 81, "missing": 0 }
},
"integrity": {
"manifestHashAlgo": "sha256",
"manifestHash": "<64-char hex of canonical manifest>",
"signatureAlgo": "ed25519",
"publicKey": "<32-byte hex>",
"signature": "<64-byte hex>",
"seedDerivation": "sha256(gitCommit + ':ruflo-witness/v1')"
}
}# Anyone with the same git commit can re-derive the public key
GITSHA=$(jq -r '.manifest.gitCommit' verification.md.json)
SEED=$(echo -n "$GITSHA:ruflo-witness/v1" | sha256sum | head -c 64)
# Then check Ed25519 signature against manifestHash with that keyFor each fix entry, the verifier computes:
- Pass — file's SHA-256 matches manifest entry exactly (no drift since issuance)
- Drift — file SHA-256 changed but the marker is still present (acceptable — codebase advanced)
- Regressed — the marker is missing from the file (real regression — fix has been removed or refactored away)
- Missing — the cited file no longer exists (rebuild needed, or fix retired)
CI gates publish on regressed === 0 && signatureValid.
A SHA-256-only check would flag every benign whitespace change as a regression. The marker is the semantic invariant — "the fix is the presence of this specific substring." If a developer refactors the file but preserves the fix, marker stays present, drift is recorded, no false alarm. If a developer deletes the fix, marker disappears, regression is caught.
Choosing markers is the load-bearing skill. Bad markers:
'function'— too generic, false-positives everywhere'TODO'— likely to flap as TODOs come and go
Good markers:
(await import('better-sqlite3')).default— distinctive and specific to the fix mechanism(ctx.flags.file as string) || ctx.args[0]— the actual swap that fixed#1859import * as bcrypt from 'bcryptjs'— proves the migration frombcrypttobcryptjsis in dist
The pattern: every regen of the witness appends one line to a JSONL file recording the snapshot. The history file is committed alongside the manifest. Queries against the history answer:
- When was a regression introduced (which commit window)
- What fixes have flapped between pass and regressed (likely a brittle marker)
- Which fixes are persistently drifting (probably an unstable file)
# For each currently-regressed fix, find the commit that introduced the regression
node history.mjs --history verification-history.jsonl regressions
# Output:
# F12
# last pass: a1b2c3d4e5f6 2026-05-07T14:23:11.000Z
# regressed at: 9f8e7d6c5b4a 2026-05-08T09:14:55.000Z
# Now triage with git
git log a1b2c3d4..9f8e7d6c -- path/to/fileThis collapses regression triage from "git bisect across 50 commits" to "read the diff for the 3 commits in this 18-hour window."
node history.mjs --history verification-history.jsonl timeline --id F12
# Output:
# 2026-05-06T... abc123 pass
# 2026-05-07T... def456 pass
# 2026-05-08T... 789abc regressed
# 2026-05-09T... 012def regressedA fix that flaps pass → regressed → pass → regressed is signalling that its marker is too brittle. A fix that's drift-only for 30 snapshots is signalling that its file is undergoing constant refactor and its SHA-256 baseline is meaningless — either accept perpetual drift or update the baseline.
# Bootstrap empty manifest + history + fixes template
node plugins/ruflo-core/scripts/witness/init.mjs
# Edit witness-fixes.json to register your fixes:
# { "fixes": [ { "id": "MY-001", "desc": "...", "file": "src/foo.ts", "marker": "..." } ] }
# Install the only runtime dep
npm i @noble/ed25519
# Generate the signed manifest + first history entry
node plugins/ruflo-core/scripts/witness/regen.mjs \
--manifest verification.md.json \
--history verification-history.jsonl \
--fixes witness-fixes.json
# Commit verification.md.json + verification-history.jsonl + witness-fixes.json together
git add verification.md.json verification-history.jsonl witness-fixes.json
git commit -m "feat: bootstrap witness manifest"- name: Witness verify
run: |
node plugins/ruflo-core/scripts/witness/verify.mjs \
--manifest verification.md.json \
--json > /tmp/witness.json
node -e "
const r = require('/tmp/witness.json');
if (!r.ok) { console.error('signature or fix regressed'); process.exit(1); }
"When you ship a fix:
# 1. Identify a distinctive marker substring that will be present
# while the fix is in the file. Use a unique pattern from the diff,
# not generic words like "function" or "import".
# 2. Append to witness-fixes.json:
{
"id": "#234",
"desc": "Fix race condition in token refresh",
"file": "dist/auth.js",
"marker": "if (this._refreshing) return this._refreshing;"
}
# 3. Dry-run to confirm verified=N/N before writing:
node plugins/ruflo-core/scripts/witness/regen.mjs \
--manifest verification.md.json \
--history verification-history.jsonl \
--fixes witness-fixes.json \
--dry-run
# 4. Real run if dry-run looks good
node plugins/ruflo-core/scripts/witness/regen.mjs \
--manifest verification.md.json \
--history verification-history.jsonl \
--fixes witness-fixes.json
# 5. Commit the trio
git add verification.md.json verification-history.jsonl witness-fixes.jsonCI reports F12 regressed. To find when it broke:
node plugins/ruflo-core/scripts/witness/history.mjs \
--history verification-history.jsonl regressions
# Output:
# F12
# last pass: a1b2c3d4 2026-05-07T14:23:11.000Z
# regressed at: 9f8e7d6c 2026-05-08T09:14:55.000Z
# Read the diff for the 3 commits in that window
git log a1b2c3d4..9f8e7d6c -- $(jq -r '.manifest.fixes[] | select(.id == "F12") | .file' verification.md.json)jobs:
smoke-install:
name: Smoke install / Node ${{ matrix.node }}
strategy:
matrix: { node: ['22', '24'] }
steps:
- run: scripts/test-fresh-install.sh
plugin-hooks-smoke:
strategy:
matrix: { node: ['20', '22'] }
steps:
- run: node scripts/test-hooks.mjs "node $PWD/bin/cli.js"
witness-verify:
needs: [smoke-install, plugin-hooks-smoke] # both behavioral layers must pass first
steps:
- run: |
node scripts/witness/verify.mjs --manifest verification.md.json --json > /tmp/r.json
node -e "if (!require('/tmp/r.json').ok) process.exit(1)"
- run: |
node scripts/witness/history.mjs --history verification-history.jsonl summary
# Soft signal: prints "newly regressed" fixes if any
publish:
needs: [smoke-install, plugin-hooks-smoke, witness-verify]
if: github.ref == 'refs/heads/main'
steps:
- run: npm publishThe publish step gates on all three layers green. Behavioral smoke catches user-experience regressions. Witness catches presence regressions. History surfaces the introduction commit. Together they provide both prevention and forensics.
These are the specific traps I hit wiring this into ruflo's GitHub Actions and that adopters will hit too. The fixes are small once you know to look for them; the failure modes are subtle when you don't.
verify.mjs loads @noble/ed25519 via createRequire. With pnpm's default isolated node-linker, transitive deps don't hoist to the workspace root unless a workspace member declares them directly. Locally you might have a flat copy at <root>/node_modules from an earlier npm install and never notice. In CI, fresh pnpm-only install — and the probe fails silently into signatureValid: false.
Fix: the probes array in verify.mjs and lib.mjs should include the workspace packages that do declare @noble/ed25519 directly:
const probes = [
repoRoot,
join(repoRoot, 'v3'),
join(repoRoot, 'v3/@claude-flow/cli'), // declares ed25519
join(repoRoot, 'v3/@claude-flow/plugin-agent-federation'), // declares ed25519
];Adapt the inner package paths to your repo layout. The shipped script is pre-configured for ruflo's monorepo; in other projects, edit the array to match wherever @noble/ed25519 is a direct dep.
There are two ways to invoke the verifier: the bundled CLI subcommand (ruflo verify) and the standalone plugin script (plugins/ruflo-core/scripts/witness/verify.mjs). They produce identical output.
Use the standalone in CI. The CLI binary may transitively load native modules (e.g. sharp for image processing, onnxruntime-node for embeddings). pnpm v8 doesn't run native postinstall scripts by default, so the prebuilds aren't fetched and the CLI fails on first import — long before reaching the verify code. The standalone has zero deps beyond @noble/ed25519.
# ✗ Don't do this in CI — pulls in CLI's native deps
- run: node bin/cli.js verify --manifest verification.md.json
# ✓ Do this — pure-JS, only @noble/ed25519
- run: node plugins/ruflo-core/scripts/witness/verify.mjs --manifest verification.md.json --jsonIf the smoke job packs a workspace package (e.g. npm pack the memory package, then install the tarball with --omit=optional to simulate a Node version without prebuilds), npm rejects workspace:* protocol entries with EUNSUPPORTEDPROTOCOL.
Fix: use pnpm pack instead — it rewrites workspace:* to resolved versions before tarballing, producing a tarball that plain npm install can consume.
- name: Install workspace + build memory
working-directory: v3
run: |
pnpm install --frozen-lockfile
pnpm --filter @claude-flow/memory... run build
- name: Pack memory tarball (pnpm rewrites workspace:* → versions)
id: pack
working-directory: v3/@claude-flow/memory
run: |
TARBALL=$(pnpm pack --pack-destination /tmp 2>&1 | grep -E "\.tgz$" | head -1)
echo "tarball=$TARBALL" >> "$GITHUB_OUTPUT"set -e (the GitHub Actions default for run: blocks) kills the bash script the instant verify.mjs returns non-zero — before any diagnostic node block runs. Result: a 65ms job failure with no log output, and you have no idea which fix regressed or whether the signature even loaded.
Fix: always wrap the verify call in set +e ... set -e, capture both streams, and analyze unconditionally:
- name: Verify witness manifest
run: |
set +e
node plugins/ruflo-core/scripts/witness/verify.mjs \
--manifest verification.md.json \
--json > /tmp/witness-result.json 2> /tmp/witness-result.err
VERIFY_EXIT=$?
set -e
echo "--- verify.mjs exit code: $VERIFY_EXIT ---"
echo "--- stderr ---"
cat /tmp/witness-result.err || true
echo "--- summary ---"
node -e "
const fs = require('fs');
const raw = fs.readFileSync('/tmp/witness-result.json', 'utf8');
if (!raw.trim()) { console.error('verify.mjs produced no JSON output'); process.exit(1); }
const r = JSON.parse(raw);
console.log(JSON.stringify({signature: r.signature, summary: r.summary}, null, 2));
const failures = (r.results || []).filter(x => x.status !== 'pass' && x.status !== 'drift');
if (failures.length) {
console.error('non-pass fixes:');
for (const f of failures) console.error(' ' + f.status + ': ' + f.id + ' (' + f.file + ')');
}
if (!r.ok) { console.error('witness verify FAILED'); process.exit(1); }
if (r.summary.regressed > 0) { console.error('regressed fixes:', r.summary.regressed); process.exit(1); }
console.log('witness verify ok:', r.summary.pass, 'pass,', r.summary.drift, 'drift');
"This costs nothing on the green path and gives you a concrete failure cause on the red path. The pattern generalizes — any CI step that gates on a signed/cryptographic check should surface why the check failed, not just that it failed.
| Failure class | Layer that catches it | Example |
|---|---|---|
| Install fails on platform without prebuilds | 1 (install smoke) | npm install errors out during native build |
| Wrong CLI flag handling, parser ambiguity | 1 (subprocess smoke) | --flag value records the wrong value |
| Plugin calls flag the CLI doesn't have | 1 (subprocess smoke) | Hook prints Invalid value for --format: true |
| Documented fix silently removed | 2 (witness markers) | Refactor deletes the load-bearing line, code still compiles |
| Fix regressed: which commit? | 3 (history) | git bisect reduced to 3 commits in 18-hour window |
| Marker too brittle, flaps pass↔regressed | 3 (history) | Status timeline shows oscillation |
- No CLI required for adopters. The standalone scripts depend only on
@noble/ed25519(~15KB minified). Copyplugins/ruflo-core/scripts/witness/into your project, install one package, run. - JSONL is committed, not gitignored. Without committed history, you lose Layer 3 entirely.
- Markers are the load-bearing skill. Generic markers false-positive; brittle markers flap. Aim for unique patterns specific to the fix mechanism.
- The two layers complement each other. Behavioral smoke catches things you wrote a test for. Witness catches things you didn't. Don't pick one.
| Path | Role |
|---|---|
plugins/ruflo-core/scripts/witness/lib.mjs |
Shared regen + history primitives |
plugins/ruflo-core/scripts/witness/init.mjs |
Bootstrap into a fresh project |
plugins/ruflo-core/scripts/witness/regen.mjs |
Sign manifest + append history |
plugins/ruflo-core/scripts/witness/verify.mjs |
Validate signature + markers |
plugins/ruflo-core/scripts/witness/history.mjs |
Query temporal log |
plugins/ruflo-core/skills/witness/SKILL.md |
Workflow + anti-patterns |
plugins/ruflo-core/agents/witness-curator.md |
Agent for adding fixes / interpreting regressions |
verification.md.json |
The signed manifest itself |
verification-history.jsonl |
The append-only temporal log |
witness-fixes.json |
Project-specific fix list (input to regen) |
v3/docs/adr/ADR-102-plugin-hook-cli-flag-regression-ci-guard.md— smoke harness pattern + flag-priority CLI conventionv3/docs/adr/ADR-103-witness-temporal-history.md— JSONL history layer + plugin-distributed toolkit
- ruvnet/ruflo
verification.md— original witness manifest documentation ~/.claude/.../project_verification_process.md— pre-toolkit inline regen process (superseded)
{ "v": 1, "commit": "54c706f56138...", "issuedAt": "2026-05-09T01:00:47.879Z", "branch": "main", "manifestHash": "<64-char hex>", "summary": { "totalFixes": 81, "verified": 81, "missing": 0 }, "fixes": { "#1867": { "sha256": "...", "markerVerified": true }, "F1": { "sha256": "...", "markerVerified": true }, /* ... one entry per fix, keyed by id ... */ } }