Require real behavior proof for external PRs (#77622)

* ci: require real behavior proof for external PRs * fix: tighten real behavior proof heuristics * fix: reject test-only real behavior proof labels --------- Co-authored-by: Peter Steinberger <steipete@gmail.com>
2026-05-06 04:40:43 +00:00 · 2026-05-04 21:45:30 -07:00
parent d02fbc6116
commit 70f34bf177
10 changed files with 671 additions and 11 deletions
--- a/.github/pull_request_template.md
+++ b/.github/pull_request_template.md
@@ -35,6 +35,18 @@ If this PR fixes a plugin beta-release blocker, title it `fix(<plugin-id>): beta
 - Related #
 - [ ] This PR fixes a bug or regression

+## Real behavior proof (required for external PRs)
+
+External contributors must show after-fix evidence from a real OpenClaw setup. Unit tests, mocks, lint, typechecks, snapshots, and CI are supplemental only. Screenshots are encouraged even for CLI, console, text, or log changes; terminal screenshots and copied live output count.
+
+- Behavior or issue addressed:
+- Real environment tested:
+- Exact steps or command run after this patch:
+- Evidence after fix (screenshot, recording, terminal capture, console output, redacted runtime log, linked artifact, or copied live output):
+- Observed result after fix:
+- What was not tested:
+- Before evidence (optional but encouraged):
+
 ## Root Cause (if applicable)

 For bug fixes or regressions, explain why this happened, not just what changed. Otherwise write `N/A`. If the cause is unclear, write `Unknown`.
--- a/.github/workflows/auto-response.yml
+++ b/.github/workflows/auto-response.yml
@@ -6,7 +6,7 @@ on:
  issue_comment:
    types: [created]
  pull_request_target: # zizmor: ignore[dangerous-triggers] maintainer-owned label automation; trusted base checkout only, no untrusted PR code execution
-    types: [opened, edited, synchronize, reopened, labeled]
+    types: [opened, edited, synchronize, reopened, labeled, unlabeled]

 env:
  FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: "true"
--- a/.github/workflows/real-behavior-proof.yml
+++ b/.github/workflows/real-behavior-proof.yml
@@ -0,0 +1,29 @@
+name: Real behavior proof
+
+on:
+  pull_request_target: # zizmor: ignore[dangerous-triggers] trusted base checkout only; no untrusted PR code execution
+    types: [opened, edited, synchronize, reopened, ready_for_review, labeled, unlabeled]
+
+env:
+  FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: "true"
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref || github.run_id }}
+  cancel-in-progress: true
+
+permissions: {}
+
+jobs:
+  real-behavior-proof:
+    name: Real behavior proof
+    permissions:
+      contents: read
+      pull-requests: read
+    runs-on: ubuntu-24.04
+    steps:
+      - uses: actions/checkout@v6
+        with:
+          ref: ${{ github.event.pull_request.base.sha }}
+          persist-credentials: false
+      - name: Check real behavior proof
+        run: node scripts/github/real-behavior-proof-check.mjs
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -11,6 +11,7 @@ Docs: https://docs.openclaw.ai
 ### Changes

 - Gateway/Windows: bind the default loopback gateway listener only to `127.0.0.1` on Windows so libuv's dual-stack `::1` behavior cannot wedge localhost HTTP requests. (#69701, fixes #69674) Thanks @SARAMALI15792.
+- Contributor PRs: require external pull requests to include after-fix real behavior proof from a real OpenClaw setup, with terminal screenshots, console output, redacted runtime logs, linked artifacts, and copied live output treated as valid evidence while unit tests, mocks, lint, typechecks, snapshots, and CI remain supplemental only.
 - Plugins/migration: emit catalog-backed install hints when `plugins.entries` or `plugins.allow` references an official external plugin that is not installed, so upgraded configs point operators to `openclaw plugins install <spec>` instead of telling them to remove valid plugin config. (#77483) Thanks @hclsys.
 - OpenAI/Codex media: advertise Codex audio transcription in runtime and manifest metadata and route active Codex chat models to the OpenAI transcription default instead of sending chat model ids to audio transcription. Thanks @vincentkoc.
 - Dependencies: refresh runtime and provider packages including Pi 0.73.0, ACPX adapters, OpenAI, Anthropic, Slack, and TypeScript native preview, while keeping the Bedrock runtime installer override pinned below the Windows ARM Node 24 npm resolver failure.
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -100,6 +100,7 @@ For coordinated change sets that genuinely need more than 20 PRs, join the **#cl
 ## Before You PR

 - Test locally with your OpenClaw instance
+- External PRs must include a filled **Real behavior proof** section in the PR body. Show the real setup you tested, the exact command or steps you ran after the patch, after-fix evidence, the observed result, and anything you did not test. Screenshots, recordings, terminal screenshots, console output, copied live output, linked artifacts, and redacted runtime logs all count. Unit tests, mocks, snapshots, lint, typechecks, and CI are useful but do not satisfy this requirement by themselves. Maintainers may apply `proof: override` only when the proof gate should not apply.
 - Run tests: `pnpm build && pnpm check && pnpm test`
 - For iterative local commits, `scripts/committer --fast "message" <files...>` passes `FAST_COMMIT=1` through to the pre-commit hook so it skips the repo-wide `pnpm check`. Only use it when you've already run equivalent targeted validation for the touched surface.
 - For extension/plugin changes, run the fast local lane first:
@@ -160,7 +161,7 @@ Built with Codex, Claude, or other AI tools? **Awesome - just mark it!**
 Please include in your PR:

 - [ ] Mark as AI-assisted in the PR title or description
- [ ] Note the degree of testing (untested / lightly tested / fully tested)
+- [ ] Include human-run real behavior proof from your own setup. AI-generated tests, mocks, lint, typechecks, and CI output are supplemental only; they do not prove the fix works for users.
 - [ ] Include prompts or session logs if possible (super helpful!)
 - [ ] Confirm you understand what the code does
 - [ ] If you have access to Codex, run `codex review --base origin/main` locally and address the findings before asking for review
--- a/scripts/github/barnacle-auto-response.mjs
+++ b/scripts/github/barnacle-auto-response.mjs
@@ -1,5 +1,13 @@
 // Barnacle owns deterministic GitHub triage and auto-response behavior.

+import {
+  MOCK_ONLY_PROOF_LABEL,
+  NEEDS_REAL_BEHAVIOR_PROOF_LABEL,
+  PROOF_OVERRIDE_LABEL,
+  evaluateRealBehaviorProof,
+  labelsForRealBehaviorProof,
+} from "./real-behavior-proof-policy.mjs";
+
 const activePrLimit = 20;

 const thirdPartyExtensionMessage =
@@ -134,6 +142,18 @@ export const managedLabelSpecs = {
    color: "C5DEF5",
    description: "Candidate: PR template appears mostly untouched.",
  },
+  [NEEDS_REAL_BEHAVIOR_PROOF_LABEL]: {
+    color: "C5DEF5",
+    description: "Candidate: external PR needs after-fix proof from a real setup.",
+  },
+  [MOCK_ONLY_PROOF_LABEL]: {
+    color: "C5DEF5",
+    description: "Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI.",
+  },
+  [PROOF_OVERRIDE_LABEL]: {
+    color: "C2E0C6",
+    description: "Maintainer override for the external PR real behavior proof gate.",
+  },
  "triage: dirty-candidate": {
    color: "C5DEF5",
    description: "Candidate: broad unrelated surfaces; may need splitting or cleanup.",
@@ -154,6 +174,8 @@ export const candidateLabels = {
  docsDiscoverability: "triage: docs-discoverability",
  testOnlyNoBug: "triage: test-only-no-bug",
  refactorOnly: "triage: refactor-only",
+  needsRealBehaviorProof: NEEDS_REAL_BEHAVIOR_PROOF_LABEL,
+  mockOnlyProof: MOCK_ONLY_PROOF_LABEL,
  dirtyCandidate: "triage: dirty-candidate",
  riskyInfra: "triage: risky-infra",
  externalPluginCandidate: "triage: external-plugin-candidate",
@@ -196,10 +218,23 @@ const maintainerAuthorLabel = "maintainer";
 const privilegedAuthorAssociations = new Set(["OWNER", "MEMBER", "COLLABORATOR"]);
 const privilegedRepositoryRoles = new Set(["admin", "maintain", "write"]);
 const candidateLabelValues = Object.values(candidateLabels);
+const proofCandidateLabelValues = [NEEDS_REAL_BEHAVIOR_PROOF_LABEL, MOCK_ONLY_PROOF_LABEL];
 const noisyPrMessage =
  "Closing this PR because it looks dirty (too many unrelated or unexpected changes). This usually happens when a branch picks up unrelated commits or a merge went sideways. Please recreate the PR from a clean branch.";

 const candidateActionRules = [
+  {
+    label: candidateLabels.needsRealBehaviorProof,
+    close: true,
+    message:
+      "Closing this PR because it does not include real behavior proof. Please reopen or resubmit with after-fix evidence from a real OpenClaw setup; terminal screenshots, console output, redacted logs, recordings, linked artifacts, and copied live output count. Unit tests, mocks, snapshots, lint, typechecks, and CI are supplemental only.",
+  },
+  {
+    label: candidateLabels.mockOnlyProof,
+    close: true,
+    message:
+      "Closing this PR because the proof only shows tests, mocks, snapshots, lint, typechecks, or CI. Please reopen or resubmit with after-fix evidence from a real OpenClaw setup; terminal screenshots, console output, redacted logs, recordings, linked artifacts, and copied live output count.",
+  },
  {
    label: candidateLabels.dirtyCandidate,
    close: true,
@@ -438,6 +473,14 @@ export function classifyPullRequestCandidateLabels(pullRequest, files) {
    labelsToAdd.push(candidateLabels.blankTemplate);
  }

+  labelsToAdd.push(
+    ...labelsForRealBehaviorProof(
+      evaluateRealBehaviorProof({
+        pullRequest,
+      }),
+    ),
+  );
+
  const docsOnly = filenames.every(isMarkdownOrDocsFile);
  const docsSignal =
    /\b(add|adds|update|updates|fix|fixes|improve|cleanup|clean up|typo|readme|docs?|documentation|translation|translate)\b/i.test(
@@ -718,14 +761,18 @@ async function addMissingLabels(github, context, core, issueNumber, labels, labe

 async function applyPullRequestCandidateLabels(github, context, core, pullRequest, labelSet) {
  const files = await listPullRequestFiles(github, context, pullRequest);
-  await addMissingLabels(
-    github,
-    context,
-    core,
-    pullRequest.number,
-    classifyPullRequestCandidateLabels(pullRequest, files),
-    labelSet,
+  const classifiedLabels = classifyPullRequestCandidateLabels(
+    {
+      ...pullRequest,
+      labels: [...labelSet].map((name) => ({ name })),
+    },
+    files,
  );
+  const staleProofLabels = proofCandidateLabelValues.filter(
+    (label) => labelSet.has(label) && !classifiedLabels.includes(label),
+  );
+  await removeLabels(github, context, pullRequest.number, staleProofLabels, labelSet);
+  await addMissingLabels(github, context, core, pullRequest.number, classifiedLabels, labelSet);
 }

 function isAutomationUser(user, fallbackLogin = "") {
@@ -931,7 +978,9 @@ export async function runBarnacleAutoResponse({ github, context, core = console
  const isLabelEvent = context.payload.action === "labeled";
  const isPrCandidateEvent =
    pullRequest &&
-    ["opened", "edited", "synchronize", "reopened", "labeled"].includes(context.payload.action);
+    ["opened", "edited", "synchronize", "reopened", "labeled", "unlabeled"].includes(
+      context.payload.action,
+    );
  if (!hasTriggerLabel && !isLabelEvent && !isPrCandidateEvent) {
    return;
  }
--- a/scripts/github/real-behavior-proof-check.mjs
+++ b/scripts/github/real-behavior-proof-check.mjs
@@ -0,0 +1,34 @@
+#!/usr/bin/env node
+import { readFileSync } from "node:fs";
+import { evaluateRealBehaviorProof } from "./real-behavior-proof-policy.mjs";
+
+function escapeCommandValue(value) {
+  return String(value)
+    .replace(/%/g, "%25")
+    .replace(/\r/g, "%0D")
+    .replace(/\n/g, "%0A")
+    .replace(/:/g, "%3A");
+}
+
+const eventPath = process.env.GITHUB_EVENT_PATH;
+if (!eventPath) {
+  console.error("::error title=Real behavior proof failed::GITHUB_EVENT_PATH is not set.");
+  process.exit(1);
+}
+
+const event = JSON.parse(readFileSync(eventPath, "utf8"));
+const pullRequest = event.pull_request;
+if (!pullRequest) {
+  console.log("No pull_request payload found; skipping real behavior proof gate.");
+  process.exit(0);
+}
+
+const evaluation = evaluateRealBehaviorProof({ pullRequest });
+if (evaluation.passed) {
+  console.log(evaluation.reason);
+  process.exit(0);
+}
+
+const message = `${evaluation.reason} Add after-fix evidence from a real OpenClaw setup in the PR body. Screenshots, recordings, terminal screenshots, console output, redacted runtime logs, linked artifacts, or copied live output count. Unit tests, mocks, snapshots, lint, typechecks, and CI are supplemental only. A maintainer can apply proof: override when appropriate.`;
+console.error(`::error title=Real behavior proof required::${escapeCommandValue(message)}`);
+process.exit(1);
--- a/scripts/github/real-behavior-proof-policy.mjs
+++ b/scripts/github/real-behavior-proof-policy.mjs
@@ -0,0 +1,284 @@
+export const PROOF_OVERRIDE_LABEL = "proof: override";
+export const NEEDS_REAL_BEHAVIOR_PROOF_LABEL = "triage: needs-real-behavior-proof";
+export const MOCK_ONLY_PROOF_LABEL = "triage: mock-only-proof";
+
+const privilegedAuthorAssociations = new Set(["OWNER", "MEMBER", "COLLABORATOR"]);
+
+const requiredProofFields = [
+  {
+    key: "behavior",
+    names: ["Behavior or issue addressed", "Issue addressed", "Behavior addressed"],
+  },
+  {
+    key: "environment",
+    names: ["Real environment tested", "Environment tested", "Real setup tested"],
+  },
+  {
+    key: "steps",
+    names: [
+      "Exact steps or command run after this patch",
+      "Exact steps or command run after the patch",
+      "Exact steps or command run after fix",
+      "Steps run after the patch",
+      "Command run after the patch",
+    ],
+  },
+  {
+    key: "evidence",
+    names: [
+      "Evidence after fix",
+      "After-fix evidence",
+      "Evidence link or embedded proof",
+      "Evidence",
+    ],
+  },
+  {
+    key: "observedResult",
+    names: ["Observed result after fix", "Observed result after the fix", "Observed result"],
+  },
+  {
+    key: "notTested",
+    names: ["What was not tested", "Not tested"],
+    allowNone: true,
+  },
+];
+
+const allProofFieldNames = requiredProofFields
+  .flatMap((field) => field.names)
+  .concat(["Before evidence", "Before evidence optional"]);
+
+const missingValueRegex =
+  /^(?:n\/?a|not applicable|tbd|todo|unknown|unsure|none provided|no evidence|not tested|untested|-|\[[^\]]*\])$/i;
+
+const standaloneMissingProofRegex =
+  /^\s*(?:[-*]\s*)?(?:n\/?a|not applicable|not tested|untested|no evidence|did not test|didn't test|could not test|couldn't test)\s*\.?\s*$/im;
+
+const mockOnlyEvidenceRegex =
+  /\b(?:pnpm|npm|yarn|bun)\s+(?:run\s+)?(?:test|vitest|lint|typecheck|tsgo|build|check)\b|\b(?:vitest|unit tests?|mock(?:ed|s)?|snapshots?|lint|typechecks?|tsgo|ci(?:\s+passes?)?)\b/i;
+
+const artifactEvidenceRegex =
+  /!\[[^\]]*\]\([^)]+\)|github\.com\/user-attachments\/assets\/|github\.com\/[^/\s]+\/[^/\s]+\/actions\/runs\/\d+\/artifacts\/\d+|https?:\/\/\S+\.(?:png|jpe?g|gif|webp|mp4|mov|webm)\b/i;
+
+const evidenceDescriptorRegex =
+  /\b(?:screenshot|screen\s*recording|recording|terminal\s+(?:capture|screenshot|transcript|output)|console\s+(?:output|log)|runtime\s+logs?|redacted\s+logs?|live\s+output|actual\s+output|observed\s+output|stdout|stderr|stack trace|trace excerpt|log excerpt|linked\s+artifacts?|artifact\s+links?)\b|```[\s\S]*\n[\s\S]*\n```/i;
+
+const liveCommandRegex =
+  /\b(?:openclaw|node|docker|curl|gh|ssh|adb|xcrun|xcodebuild|open|npm\s+run|pnpm\s+openclaw)\b/i;
+
+const mockOnlyEvidenceStripRegex =
+  /\b(?:pnpm|npm|yarn|bun)\s+(?:run\s+)?(?:test|vitest|lint|typecheck|tsgo|build|check)\b|\b(?:vitest|unit tests?|mock(?:ed|s)?|snapshots?|lint|typechecks?|tsgo|ci(?:\s+passes?)?|tests?|passed|passes|green|success|succeeded|with|and|the|branch|only|output|transcript|capture|fenced)\b/gi;
+
+const evidenceDescriptorStripRegex =
+  /\b(?:screenshot|screen\s*recording|recording|terminal\s+(?:capture|screenshot|transcript|output)|console\s+(?:output|log)|runtime\s+logs?|redacted\s+logs?|live\s+output|actual\s+output|observed\s+output|stdout|stderr|stack trace|trace excerpt|log excerpt|linked\s+artifacts?|artifact\s+links?)\b/gi;
+
+function escapeRegex(text) {
+  return text.replace(/[.*+?^${}()|[\]\\]/g, "\\$&");
+}
+
+function labelNames(labels) {
+  return new Set(
+    (labels ?? [])
+      .map((label) => (typeof label === "string" ? label : label?.name))
+      .filter((label) => typeof label === "string"),
+  );
+}
+
+function isAutomationUser(user = {}, fallbackLogin = "") {
+  const login = user?.login ?? fallbackLogin;
+  return user?.type === "Bot" || /\[bot\]$/i.test(login) || login.startsWith("app/");
+}
+
+export function isExternalPullRequest(pullRequest) {
+  if (!pullRequest) {
+    return false;
+  }
+  if (isAutomationUser(pullRequest.user)) {
+    return false;
+  }
+  const authorAssociation = String(
+    pullRequest.author_association ?? pullRequest.authorAssociation ?? "",
+  ).toUpperCase();
+  return !privilegedAuthorAssociations.has(authorAssociation);
+}
+
+export function hasProofOverride(labels) {
+  return labelNames(labels).has(PROOF_OVERRIDE_LABEL);
+}
+
+export function extractRealBehaviorProofSection(body = "") {
+  const headingRegex = /^#{2,6}\s+real behavior proof\b[^\n]*$/gim;
+  const match = headingRegex.exec(body);
+  if (!match) {
+    return "";
+  }
+  const sectionStart = match.index + match[0].length;
+  const rest = body.slice(sectionStart);
+  const nextHeading = rest.match(/\n#{1,6}\s+\S/);
+  return (nextHeading ? rest.slice(0, nextHeading.index) : rest).trim();
+}
+
+function fieldLineRegex(name) {
+  return new RegExp(
+    `^\\s*(?:[-*]\\s*)?(?:\\*\\*)?${escapeRegex(name)}(?:\\s*\\([^)]*\\))?(?:\\*\\*)?\\s*:\\s*(.*)$`,
+    "i",
+  );
+}
+
+function isAnyProofFieldLine(line) {
+  return allProofFieldNames.some((name) => fieldLineRegex(name).test(line));
+}
+
+function extractFieldValue(section, field) {
+  const lines = section.split("\n");
+  for (let index = 0; index < lines.length; index += 1) {
+    const matchingName = field.names.find((name) => fieldLineRegex(name).test(lines[index]));
+    if (!matchingName) {
+      continue;
+    }
+
+    const match = lines[index].match(fieldLineRegex(matchingName));
+    const valueLines = [match?.[1] ?? ""];
+    for (let next = index + 1; next < lines.length; next += 1) {
+      const line = lines[next];
+      if (/^#{1,6}\s+\S/.test(line) || isAnyProofFieldLine(line)) {
+        break;
+      }
+      valueLines.push(line);
+    }
+    return valueLines.join("\n").trim();
+  }
+  return "";
+}
+
+function stripProofFieldLabels(section) {
+  return section
+    .split("\n")
+    .map((line) => {
+      if (!isAnyProofFieldLine(line)) {
+        return line;
+      }
+      const matchingName = allProofFieldNames.find((name) => fieldLineRegex(name).test(line));
+      const match = matchingName ? line.match(fieldLineRegex(matchingName)) : null;
+      return match?.[1] ?? "";
+    })
+    .join("\n");
+}
+
+function isMissingValue(value, field) {
+  const trimmed = value.trim();
+  if (!trimmed) {
+    return true;
+  }
+  if (
+    field.allowNone &&
+    /^(?:none|nothing else|no known gaps|no additional gaps)$/i.test(trimmed)
+  ) {
+    return false;
+  }
+  return missingValueRegex.test(trimmed);
+}
+
+function hasNonMockEvidencePayload(value) {
+  const payload = value
+    .replace(evidenceDescriptorStripRegex, "")
+    .replace(mockOnlyEvidenceStripRegex, "")
+    .replace(/```(?:\w+)?|```/g, "")
+    .replace(/[`$>:\-_.()[\]\s]+/g, "");
+  return Boolean(payload);
+}
+
+function result(status, reason, details = {}) {
+  return {
+    status,
+    reason,
+    applies: ["passed", "missing", "mock_only", "insufficient", "override"].includes(status),
+    passed: ["passed", "skipped", "override"].includes(status),
+    ...details,
+  };
+}
+
+export function evaluateRealBehaviorProof({ pullRequest, labels } = {}) {
+  const currentLabels = labels ?? pullRequest?.labels ?? [];
+  if (hasProofOverride(currentLabels)) {
+    return result("override", `Maintainer override label ${PROOF_OVERRIDE_LABEL} is present.`);
+  }
+  if (!isExternalPullRequest(pullRequest)) {
+    return result("skipped", "Maintainer, collaborator, or bot PRs do not require this gate.");
+  }
+
+  const section = extractRealBehaviorProofSection(pullRequest?.body ?? "");
+  if (!section) {
+    return result(
+      "missing",
+      "External PRs must include a Real behavior proof section with after-fix evidence from a real setup.",
+    );
+  }
+
+  const fields = Object.fromEntries(
+    requiredProofFields.map((field) => [field.key, extractFieldValue(section, field)]),
+  );
+  const missingFields = requiredProofFields
+    .filter((field) => isMissingValue(fields[field.key] ?? "", field))
+    .map((field) => field.key);
+  if (missingFields.length > 0) {
+    return result(
+      "missing",
+      `Real behavior proof is missing required field content: ${missingFields.join(", ")}.`,
+      { fields, missingFields },
+    );
+  }
+
+  const proofContent = stripProofFieldLabels(section);
+  if (standaloneMissingProofRegex.test(proofContent)) {
+    return result("insufficient", "Real behavior proof says the changed behavior was not tested.", {
+      fields,
+    });
+  }
+
+  const evidenceContent = [fields.evidence, fields.observedResult].join("\n");
+  const proofContentForMockDetection = [fields.evidence, fields.observedResult, fields.steps].join(
+    "\n",
+  );
+  const hasArtifactEvidence = artifactEvidenceRegex.test(evidenceContent);
+  const hasNonMockPayload = hasNonMockEvidencePayload(evidenceContent);
+  const hasMockEvidenceSignal = mockOnlyEvidenceRegex.test(proofContentForMockDetection);
+  if (hasMockEvidenceSignal && !hasArtifactEvidence && !hasNonMockPayload) {
+    return result(
+      "mock_only",
+      "Unit tests, mocks, snapshots, lint, typechecks, and CI are supplemental and do not count as real behavior proof.",
+      { fields },
+    );
+  }
+
+  const hasRealEvidence =
+    hasArtifactEvidence ||
+    (evidenceDescriptorRegex.test(evidenceContent) && hasNonMockPayload) ||
+    liveCommandRegex.test(evidenceContent);
+  if (hasMockEvidenceSignal && !hasRealEvidence) {
+    return result(
+      "mock_only",
+      "Unit tests, mocks, snapshots, lint, typechecks, and CI are supplemental and do not count as real behavior proof.",
+      { fields },
+    );
+  }
+
+  if (!hasRealEvidence) {
+    return result(
+      "insufficient",
+      "Real behavior proof must include an after-fix screenshot, recording, terminal capture, console output, redacted runtime log, linked artifact, or copied live output.",
+      { fields },
+    );
+  }
+
+  return result("passed", "External PR includes after-fix real behavior proof.", { fields });
+}
+
+export function labelsForRealBehaviorProof(evaluation) {
+  if (evaluation.status === "mock_only") {
+    return [MOCK_ONLY_PROOF_LABEL];
+  }
+  if (evaluation.status === "missing" || evaluation.status === "insufficient") {
+    return [NEEDS_REAL_BEHAVIOR_PROOF_LABEL];
+  }
+  return [];
+}
--- a/test/scripts/barnacle-auto-response.test.ts
+++ b/test/scripts/barnacle-auto-response.test.ts
@@ -37,6 +37,28 @@ function pr(title: string, body = blankTemplateBody) {
  };
 }

+function realBehaviorProofBody(evidence: string, overrides: Record<string, string> = {}) {
+  const fields = {
+    behavior: "Gateway status now reports the Discord channel as ready.",
+    environment: "macOS 15.4, Node 24, local OpenClaw gateway, redacted Discord token.",
+    steps: "pnpm openclaw gateway restart and pnpm openclaw gateway status",
+    evidence,
+    observedResult: "The gateway stayed connected and Discord reported ready.",
+    notTested: "No known gaps.",
+    ...overrides,
+  };
+  return [
+    "## Real behavior proof",
+    "",
+    `- Behavior or issue addressed: ${fields.behavior}`,
+    `- Real environment tested: ${fields.environment}`,
+    `- Exact steps or command run after this patch: ${fields.steps}`,
+    `- Evidence after fix: ${fields.evidence}`,
+    `- Observed result after fix: ${fields.observedResult}`,
+    `- What was not tested: ${fields.notTested}`,
+  ].join("\n");
+}
+
 function file(filename: string, status = "modified") {
  return {
    filename,
@@ -236,6 +258,44 @@ describe("barnacle-auto-response", () => {
    );
  });

+  it("labels external PRs that are missing real behavior proof", () => {
+    const labels = classifyPullRequestCandidateLabels(pr("Fix gateway startup"), [
+      file("src/gateway/server.ts"),
+    ]);
+
+    expect(labels).toContain(candidateLabels.needsRealBehaviorProof);
+    expect(labels).not.toContain(candidateLabels.mockOnlyProof);
+  });
+
+  it("labels external PRs whose proof is only tests or mocks", () => {
+    const labels = classifyPullRequestCandidateLabels(
+      pr(
+        "Fix gateway startup",
+        realBehaviorProofBody("pnpm test passed with Vitest mocks.", {
+          steps: "pnpm test",
+          observedResult: "CI passes.",
+        }),
+      ),
+      [file("src/gateway/server.ts")],
+    );
+
+    expect(labels).toContain(candidateLabels.mockOnlyProof);
+    expect(labels).not.toContain(candidateLabels.needsRealBehaviorProof);
+  });
+
+  it("does not label external PRs that include real behavior proof", () => {
+    const labels = classifyPullRequestCandidateLabels(
+      pr(
+        "Fix gateway startup",
+        realBehaviorProofBody("![after](https://github.com/user-attachments/assets/gateway-ready)"),
+      ),
+      [file("src/gateway/server.ts")],
+    );
+
+    expect(labels).not.toContain(candidateLabels.needsRealBehaviorProof);
+    expect(labels).not.toContain(candidateLabels.mockOnlyProof);
+  });
+
  it("uses linked issues as context and suppresses low-signal docs labels", () => {
    const labels = classifyPullRequestCandidateLabels(
      pr("Update docs", `${blankTemplateBody}\n\nRelated #12345`),
@@ -577,6 +637,43 @@ describe("barnacle-auto-response", () => {
    expect(calls.update).toEqual([]);
  });

+  it("adds proof labels to external PRs without auto-closing by default", async () => {
+    const { calls, github } = barnacleGithub([file("src/gateway/server.ts")]);
+
+    await runBarnacleAutoResponse({
+      github,
+      context: barnacleContext({}),
+      core: {
+        info: () => undefined,
+      },
+    });
+
+    expect(calls.addLabels).toContainEqual(
+      expect.objectContaining({
+        labels: expect.arrayContaining([candidateLabels.needsRealBehaviorProof]),
+      }),
+    );
+    expect(calls.createComment).toEqual([]);
+    expect(calls.update).toEqual([]);
+  });
+
+  it("removes stale proof labels when override is present", async () => {
+    const { calls, github } = barnacleGithub([file("src/gateway/server.ts")]);
+
+    await runBarnacleAutoResponse({
+      github,
+      context: barnacleContext({}, [candidateLabels.needsRealBehaviorProof, "proof: override"]),
+      core: {
+        info: () => undefined,
+      },
+    });
+
+    expect(calls.removeLabel).toContainEqual(
+      expect.objectContaining({ name: candidateLabels.needsRealBehaviorProof }),
+    );
+    expect(calls.update).toEqual([]);
+  });
+
  it("actions manually applied candidate labels", async () => {
    const { calls, github } = barnacleGithub([file("extensions/example/openclaw.plugin.json")]);

@@ -637,7 +734,7 @@ describe("barnacle-auto-response", () => {
    expect(calls.removeLabel).toContainEqual(expect.objectContaining({ name: "trigger-response" }));
    expect(calls.createComment).toContainEqual(
      expect.objectContaining({
-        body: expect.stringContaining("only changes tests"),
+        body: expect.stringContaining("does not include real behavior proof"),
      }),
    );
    expect(calls.update).toContainEqual(expect.objectContaining({ state: "closed" }));
--- a/test/scripts/real-behavior-proof-policy.test.ts
+++ b/test/scripts/real-behavior-proof-policy.test.ts
@@ -0,0 +1,153 @@
+import { describe, expect, it } from "vitest";
+import {
+  MOCK_ONLY_PROOF_LABEL,
+  NEEDS_REAL_BEHAVIOR_PROOF_LABEL,
+  PROOF_OVERRIDE_LABEL,
+  evaluateRealBehaviorProof,
+  labelsForRealBehaviorProof,
+} from "../../scripts/github/real-behavior-proof-policy.mjs";
+
+function externalPr(body: string, overrides: Record<string, unknown> = {}) {
+  return {
+    body,
+    author_association: "CONTRIBUTOR",
+    user: {
+      login: "external-contributor",
+      type: "User",
+    },
+    labels: [],
+    ...overrides,
+  };
+}
+
+function proofBody(evidence: string, overrides: Record<string, string> = {}) {
+  const fields = {
+    behavior: "Gateway startup no longer drops the configured Discord channel.",
+    environment: "macOS 15.4, Node 24, local OpenClaw gateway with a redacted Discord token.",
+    steps: "pnpm openclaw gateway restart, then pnpm openclaw gateway status",
+    evidence,
+    observedResult: "The gateway stayed connected and the Discord channel showed ready.",
+    notTested: "No known gaps.",
+    ...overrides,
+  };
+  return [
+    "## Real behavior proof",
+    "",
+    `- Behavior or issue addressed: ${fields.behavior}`,
+    `- Real environment tested: ${fields.environment}`,
+    `- Exact steps or command run after this patch: ${fields.steps}`,
+    `- Evidence after fix: ${fields.evidence}`,
+    `- Observed result after fix: ${fields.observedResult}`,
+    `- What was not tested: ${fields.notTested}`,
+  ].join("\n");
+}
+
+describe("real-behavior-proof-policy", () => {
+  it.each([
+    "![after](https://github.com/user-attachments/assets/abc123)",
+    "Linked artifact: https://github.com/openclaw/openclaw/actions/runs/123456789/artifacts/987654321",
+    "Redacted runtime log: gateway connected Discord channel and delivered the reply.",
+    ["Terminal transcript:", "```text", "$ openclaw gateway status", "discord ready", "```"].join(
+      "\n",
+    ),
+  ])("passes external PRs with real after-fix evidence: %s", (evidence) => {
+    const evaluation = evaluateRealBehaviorProof({
+      pullRequest: externalPr(proofBody(evidence)),
+    });
+
+    expect(evaluation.status).toBe("passed");
+    expect(labelsForRealBehaviorProof(evaluation)).toEqual([]);
+  });
+
+  it("fails external PRs without a real behavior proof section", () => {
+    const evaluation = evaluateRealBehaviorProof({
+      pullRequest: externalPr("## Summary\n\n- Fixed startup."),
+    });
+
+    expect(evaluation.status).toBe("missing");
+    expect(labelsForRealBehaviorProof(evaluation)).toEqual([NEEDS_REAL_BEHAVIOR_PROOF_LABEL]);
+  });
+
+  it("fails external PRs that say the changed behavior was not tested", () => {
+    const evaluation = evaluateRealBehaviorProof({
+      pullRequest: externalPr(proofBody("not tested")),
+    });
+
+    expect(evaluation.status).toBe("missing");
+    expect(labelsForRealBehaviorProof(evaluation)).toEqual([NEEDS_REAL_BEHAVIOR_PROOF_LABEL]);
+  });
+
+  it("fails external PRs whose proof is only tests, mocks, snapshots, lint, typecheck, or CI", () => {
+    const evaluation = evaluateRealBehaviorProof({
+      pullRequest: externalPr(
+        proofBody("pnpm test passed and Vitest mocks cover the branch.", {
+          steps: "pnpm test",
+          observedResult: "CI passes.",
+        }),
+      ),
+    });
+
+    expect(evaluation.status).toBe("mock_only");
+    expect(labelsForRealBehaviorProof(evaluation)).toEqual([MOCK_ONLY_PROOF_LABEL]);
+  });
+
+  it("fails external PRs whose only copied output is a fenced test or CI transcript", () => {
+    const evaluation = evaluateRealBehaviorProof({
+      pullRequest: externalPr(
+        proofBody(["```text", "$ pnpm test", "CI passed with Vitest mocks", "```"].join("\n"), {
+          steps: "pnpm test",
+          observedResult: "CI passes.",
+        }),
+      ),
+    });
+
+    expect(evaluation.status).toBe("mock_only");
+    expect(labelsForRealBehaviorProof(evaluation)).toEqual([MOCK_ONLY_PROOF_LABEL]);
+  });
+
+  it("fails external PRs whose terminal label only contains test or CI output", () => {
+    const evaluation = evaluateRealBehaviorProof({
+      pullRequest: externalPr(
+        proofBody(
+          [
+            "Terminal transcript:",
+            "```text",
+            "$ pnpm test",
+            "CI passed with Vitest mocks",
+            "```",
+          ].join("\n"),
+          {
+            steps: "pnpm test",
+            observedResult: "CI passes.",
+          },
+        ),
+      ),
+    });
+
+    expect(evaluation.status).toBe("mock_only");
+    expect(labelsForRealBehaviorProof(evaluation)).toEqual([MOCK_ONLY_PROOF_LABEL]);
+  });
+
+  it("passes maintainer, bot, and override cases", () => {
+    expect(
+      evaluateRealBehaviorProof({
+        pullRequest: externalPr("", { author_association: "MEMBER" }),
+      }).status,
+    ).toBe("skipped");
+    expect(
+      evaluateRealBehaviorProof({
+        pullRequest: externalPr("", {
+          user: {
+            login: "renovate[bot]",
+            type: "Bot",
+          },
+        }),
+      }).status,
+    ).toBe("skipped");
+    expect(
+      evaluateRealBehaviorProof({
+        pullRequest: externalPr("", { labels: [{ name: PROOF_OVERRIDE_LABEL }] }),
+      }).status,
+    ).toBe("override");
+  });
+});