Commit Graph

173 Commits

Author SHA1 Message Date
Eva
108e5c89de qa-lab: scope parity metrics and harden fake-success detector
- scope computeQaAgenticParityMetrics to QA_AGENTIC_PARITY_SCENARIO_TITLES
  in buildQaAgenticParityComparison so extra non-parity lanes in a full
  qa-suite-summary.json cannot influence completion / unintended-stop /
  valid-tool / fake-success rates
- filter coverageMismatch by !parityTitleSet.has(name) so each required
  parity scenario fails the gate exactly once (from requiredScenarioCoverage)
  instead of being double-reported as a coverage mismatch too
- drop the bare /\\berror\\b/i rule from SUSPICIOUS_PASS_PATTERNS — it was
  false-flagging legitimate passes that narrate "Error budget: 0" or
  "no errors found" — and replace it with targeted /error occurred/i and
  /an error was/i phrases that indicate a real mid-turn error
- add regressions: error-budget/no-errors-observed passes yield
  fakeSuccessCount === 0, genuine error-occurred narration still flags,
  each missing required scenario fires exactly one failure line, and
  non-parity lanes do not perturb scoped metrics
- isolate the baseline suspicious-pass test by padding it to the full
  first-wave scenario set so it asserts the isolated fake-success path
  via toEqual([...]) rather than toContain
2026-04-11 14:22:48 +01:00
Eva
95f8ad215f Treat skipped parity scenarios as uncovered 2026-04-11 14:22:48 +01:00
Eva
17252df122 Tighten parity proof heuristics 2026-04-11 14:22:48 +01:00
Eva
fd45ea2bf1 test(qa): add compaction retry parity scenario 2026-04-11 14:22:48 +01:00
Eva
3211aa2540 fix(qa): surface missing required scenarios in parity report 2026-04-11 14:22:48 +01:00
Eva
55df6f11a4 fix: harden parity gate review findings 2026-04-11 14:22:48 +01:00
Eva
db09edacfc qa-lab: gate parity on shared scenario coverage 2026-04-11 14:22:48 +01:00
Eva
67fdd3b4df benchmarks: add agentic parity report gate 2026-04-11 14:22:48 +01:00
Eva
79f539d9ce docs: clarify GPT-5.4 parity harness and review flow 2026-04-11 14:22:48 +01:00
Eva
d9c7ddb099 test: add agentic parity scenario pack 2026-04-11 14:22:48 +01:00
Vincent Koc
1167093773 test(qa): drop rebase conflict marker 2026-04-11 13:24:45 +01:00
Vincent Koc
d21573d3a1 fix(qa): catch leaked harness meta replies 2026-04-11 13:23:26 +01:00
Peter Steinberger
d72fb7efb9 fix: harden QA scenario matcher validation 2026-04-11 13:19:13 +01:00
Peter Steinberger
cd89892b1f fix(release): keep private QA bundles out of npm pack 2026-04-11 13:13:11 +01:00
Ayaan Zaidi
478a2e15c5 fix: narrow qa cli facade startup path 2026-04-11 10:41:19 +05:30
Peter Steinberger
1ab6e5dbf0 chore(release): bump version to 2026.4.11 2026-04-11 04:51:17 +01:00
Ayaan Zaidi
959b1472dc test(qa-lab): include telegram mentioned-message scenario 2026-04-11 08:48:42 +05:30
Ayaan Zaidi
b0b0fb308d feat(qa-lab): add telegram mentioned-message scenario 2026-04-11 08:48:42 +05:30
Ayaan Zaidi
a0b5c7b0c4 test(qa-lab): cover telegram command demo scenarios 2026-04-11 08:48:42 +05:30
Ayaan Zaidi
7c14d8b0f4 feat(qa-lab): add telegram command demo scenarios 2026-04-11 08:48:42 +05:30
Ayaan Zaidi
f9a03f0f4b test(qa-lab): cover telegram mention-gating 2026-04-11 08:48:42 +05:30
Ayaan Zaidi
355690a72c feat(qa-lab): add telegram mention-gating scenario 2026-04-11 08:48:42 +05:30
Vincent Koc
350299401f fix(cycles): continue shared seam extraction 2026-04-11 02:46:41 +01:00
Peter Steinberger
39d1a817fa lint: enable small oxlint rules 2026-04-11 02:15:21 +01:00
Peter Steinberger
55578a5c40 fix: stabilize Codex runtime truthfulness (#64439) (thanks @100yenadmin) 2026-04-11 01:19:32 +01:00
Gustavo Madeira Santana
00837f05bf qa-lab: drain Matrix sync batch before returning match 2026-04-10 20:17:30 -04:00
Peter Steinberger
11b0016e9e refactor: simplify provider channel conversions 2026-04-11 01:08:23 +01:00
Peter Steinberger
85ee6f2967 fix: stabilize live qa suite routing 2026-04-11 00:58:40 +01:00
Gustavo Madeira Santana
25445a9f2e qa-lab: add Matrix live transport QA lane (#64489)
Merged via squash.

Prepared head SHA: ae9bb37751
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
2026-04-10 19:35:08 -04:00
Peter Steinberger
22c2af0065 test: isolate qa network fetches 2026-04-10 23:46:20 +01:00
Peter Steinberger
610407730d fix: stop qa lab children cleanly 2026-04-10 23:29:58 +01:00
Peter Steinberger
d236cb4680 chore: enable redundant type constituent checks 2026-04-10 21:23:40 +01:00
Peter Steinberger
0ebeee8b0d chore: enable consistent-return 2026-04-10 20:56:43 +01:00
Peter Steinberger
925a499d84 ci: fix additional guard failures 2026-04-10 19:23:10 +01:00
Peter Steinberger
777c6f7580 refactor: split manifest command alias helpers 2026-04-10 17:37:31 +01:00
Ayaan Zaidi
8755d2d3da fix: bound telegram qa api requests 2026-04-10 22:06:38 +05:30
Ayaan Zaidi
1512f9188d fix: reject unknown telegram qa scenarios 2026-04-10 22:06:38 +05:30
Peter Steinberger
d5df4cd4e5 test: add Anthropic Opus QA smokes 2026-04-10 17:24:54 +01:00
Ayaan Zaidi
9d3583bc2f fix(qa-lab): tighten telegram canary matching 2026-04-10 21:53:31 +05:30
Ayaan Zaidi
ecb3e0a62d fix(qa-lab): harden telegram qa artifacts 2026-04-10 21:53:31 +05:30
Ayaan Zaidi
d69cc5da5c fix(qa-lab): address remaining review comments 2026-04-10 21:53:31 +05:30
Ayaan Zaidi
2aaf5a3baa fix(qa-lab): address telegram qa review comments 2026-04-10 21:53:31 +05:30
Ayaan Zaidi
7348c3193d test(telegram): cover threaded qa replies 2026-04-10 21:53:31 +05:30
Ayaan Zaidi
88a7970f84 fix(telegram): thread native command replies 2026-04-10 21:53:31 +05:30
Ayaan Zaidi
0ff03a74a8 fix(qa-lab): trust telegram canary send result 2026-04-10 21:53:31 +05:30
Ayaan Zaidi
653a110ef6 fix(qa-lab): refine telegram canary output 2026-04-10 21:53:31 +05:30
Ayaan Zaidi
5c7a232ebc fix(qa-lab): improve telegram canary diagnostics 2026-04-10 21:53:31 +05:30
Ayaan Zaidi
e093cb6c93 feat(qa-lab): add telegram live qa lane 2026-04-10 21:53:31 +05:30
Peter Steinberger
07e7222e28 test: split Claude CLI QA auth modes 2026-04-10 14:56:36 +01:00
Peter Steinberger
ddfd6c3401 fix: guard QA lab gateway health fetch (#64242) 2026-04-10 14:56:12 +01:00