Eva
108e5c89de
qa-lab: scope parity metrics and harden fake-success detector
...
- scope computeQaAgenticParityMetrics to QA_AGENTIC_PARITY_SCENARIO_TITLES
in buildQaAgenticParityComparison so extra non-parity lanes in a full
qa-suite-summary.json cannot influence completion / unintended-stop /
valid-tool / fake-success rates
- filter coverageMismatch by !parityTitleSet.has(name) so each required
parity scenario fails the gate exactly once (from requiredScenarioCoverage)
instead of being double-reported as a coverage mismatch too
- drop the bare /\\berror\\b/i rule from SUSPICIOUS_PASS_PATTERNS — it was
false-flagging legitimate passes that narrate "Error budget: 0" or
"no errors found" — and replace it with targeted /error occurred/i and
/an error was/i phrases that indicate a real mid-turn error
- add regressions: error-budget/no-errors-observed passes yield
fakeSuccessCount === 0, genuine error-occurred narration still flags,
each missing required scenario fires exactly one failure line, and
non-parity lanes do not perturb scoped metrics
- isolate the baseline suspicious-pass test by padding it to the full
first-wave scenario set so it asserts the isolated fake-success path
via toEqual([...]) rather than toContain
2026-04-11 14:22:48 +01:00
Eva
95f8ad215f
Treat skipped parity scenarios as uncovered
2026-04-11 14:22:48 +01:00
Eva
17252df122
Tighten parity proof heuristics
2026-04-11 14:22:48 +01:00
Eva
fd45ea2bf1
test(qa): add compaction retry parity scenario
2026-04-11 14:22:48 +01:00
Eva
3211aa2540
fix(qa): surface missing required scenarios in parity report
2026-04-11 14:22:48 +01:00
Eva
55df6f11a4
fix: harden parity gate review findings
2026-04-11 14:22:48 +01:00
Eva
db09edacfc
qa-lab: gate parity on shared scenario coverage
2026-04-11 14:22:48 +01:00
Eva
67fdd3b4df
benchmarks: add agentic parity report gate
2026-04-11 14:22:48 +01:00
Eva
79f539d9ce
docs: clarify GPT-5.4 parity harness and review flow
2026-04-11 14:22:48 +01:00
Eva
d9c7ddb099
test: add agentic parity scenario pack
2026-04-11 14:22:48 +01:00
Vincent Koc
1167093773
test(qa): drop rebase conflict marker
2026-04-11 13:24:45 +01:00
Vincent Koc
d21573d3a1
fix(qa): catch leaked harness meta replies
2026-04-11 13:23:26 +01:00
Peter Steinberger
d72fb7efb9
fix: harden QA scenario matcher validation
2026-04-11 13:19:13 +01:00
Peter Steinberger
cd89892b1f
fix(release): keep private QA bundles out of npm pack
2026-04-11 13:13:11 +01:00
Ayaan Zaidi
478a2e15c5
fix: narrow qa cli facade startup path
2026-04-11 10:41:19 +05:30
Peter Steinberger
1ab6e5dbf0
chore(release): bump version to 2026.4.11
2026-04-11 04:51:17 +01:00
Ayaan Zaidi
959b1472dc
test(qa-lab): include telegram mentioned-message scenario
2026-04-11 08:48:42 +05:30
Ayaan Zaidi
b0b0fb308d
feat(qa-lab): add telegram mentioned-message scenario
2026-04-11 08:48:42 +05:30
Ayaan Zaidi
a0b5c7b0c4
test(qa-lab): cover telegram command demo scenarios
2026-04-11 08:48:42 +05:30
Ayaan Zaidi
7c14d8b0f4
feat(qa-lab): add telegram command demo scenarios
2026-04-11 08:48:42 +05:30
Ayaan Zaidi
f9a03f0f4b
test(qa-lab): cover telegram mention-gating
2026-04-11 08:48:42 +05:30
Ayaan Zaidi
355690a72c
feat(qa-lab): add telegram mention-gating scenario
2026-04-11 08:48:42 +05:30
Vincent Koc
350299401f
fix(cycles): continue shared seam extraction
2026-04-11 02:46:41 +01:00
Peter Steinberger
39d1a817fa
lint: enable small oxlint rules
2026-04-11 02:15:21 +01:00
Peter Steinberger
55578a5c40
fix: stabilize Codex runtime truthfulness ( #64439 ) (thanks @100yenadmin)
2026-04-11 01:19:32 +01:00
Gustavo Madeira Santana
00837f05bf
qa-lab: drain Matrix sync batch before returning match
2026-04-10 20:17:30 -04:00
Peter Steinberger
11b0016e9e
refactor: simplify provider channel conversions
2026-04-11 01:08:23 +01:00
Peter Steinberger
85ee6f2967
fix: stabilize live qa suite routing
2026-04-11 00:58:40 +01:00
Gustavo Madeira Santana
25445a9f2e
qa-lab: add Matrix live transport QA lane ( #64489 )
...
Merged via squash.
Prepared head SHA: ae9bb37751
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com >
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com >
Reviewed-by: @gumadeiras
2026-04-10 19:35:08 -04:00
Peter Steinberger
22c2af0065
test: isolate qa network fetches
2026-04-10 23:46:20 +01:00
Peter Steinberger
610407730d
fix: stop qa lab children cleanly
2026-04-10 23:29:58 +01:00
Peter Steinberger
d236cb4680
chore: enable redundant type constituent checks
2026-04-10 21:23:40 +01:00
Peter Steinberger
0ebeee8b0d
chore: enable consistent-return
2026-04-10 20:56:43 +01:00
Peter Steinberger
925a499d84
ci: fix additional guard failures
2026-04-10 19:23:10 +01:00
Peter Steinberger
777c6f7580
refactor: split manifest command alias helpers
2026-04-10 17:37:31 +01:00
Ayaan Zaidi
8755d2d3da
fix: bound telegram qa api requests
2026-04-10 22:06:38 +05:30
Ayaan Zaidi
1512f9188d
fix: reject unknown telegram qa scenarios
2026-04-10 22:06:38 +05:30
Peter Steinberger
d5df4cd4e5
test: add Anthropic Opus QA smokes
2026-04-10 17:24:54 +01:00
Ayaan Zaidi
9d3583bc2f
fix(qa-lab): tighten telegram canary matching
2026-04-10 21:53:31 +05:30
Ayaan Zaidi
ecb3e0a62d
fix(qa-lab): harden telegram qa artifacts
2026-04-10 21:53:31 +05:30
Ayaan Zaidi
d69cc5da5c
fix(qa-lab): address remaining review comments
2026-04-10 21:53:31 +05:30
Ayaan Zaidi
2aaf5a3baa
fix(qa-lab): address telegram qa review comments
2026-04-10 21:53:31 +05:30
Ayaan Zaidi
7348c3193d
test(telegram): cover threaded qa replies
2026-04-10 21:53:31 +05:30
Ayaan Zaidi
88a7970f84
fix(telegram): thread native command replies
2026-04-10 21:53:31 +05:30
Ayaan Zaidi
0ff03a74a8
fix(qa-lab): trust telegram canary send result
2026-04-10 21:53:31 +05:30
Ayaan Zaidi
653a110ef6
fix(qa-lab): refine telegram canary output
2026-04-10 21:53:31 +05:30
Ayaan Zaidi
5c7a232ebc
fix(qa-lab): improve telegram canary diagnostics
2026-04-10 21:53:31 +05:30
Ayaan Zaidi
e093cb6c93
feat(qa-lab): add telegram live qa lane
2026-04-10 21:53:31 +05:30
Peter Steinberger
07e7222e28
test: split Claude CLI QA auth modes
2026-04-10 14:56:36 +01:00
Peter Steinberger
ddfd6c3401
fix: guard QA lab gateway health fetch ( #64242 )
2026-04-10 14:56:12 +01:00