Commit Graph

200 Commits

Author SHA1 Message Date
Vincent Koc
e9720c27fa fix(qa): accept Codex capped read evidence (#96366) 2026-06-24 18:07:13 +08:00
Vincent Koc
8242923fe3 fix(qa): allow async runtime fixture starts 2026-06-24 17:52:16 +08:00
Dallin Romney
9666db607e test(qa): clean up smoke taxonomy profile (#96320) 2026-06-24 00:43:00 -07:00
Vincent Koc
0e71ae5df4 fix(qa): enforce fanout completion drain 2026-06-24 08:29:37 +08:00
Vincent Koc
e457c4c324 fix(qa): drain fanout child completions 2026-06-24 08:29:37 +08:00
Vincent Koc
ab9d3ad6d7 fix(qa): settle channel no-reply check 2026-06-24 08:29:37 +08:00
Vincent Koc
960b9fa4f3 fix(qa): scope no-outbound waits 2026-06-24 08:29:37 +08:00
Vincent Koc
6f80552ee9 fix(qa): prove direct reply routing via qa channel 2026-06-24 00:41:28 +08:00
Dallin Romney
f6b2a5ffb4 test(qa): harden all-profile evidence scenarios (#96003) 2026-06-23 00:07:51 -07:00
Vincent Koc
f24b1a9c0c test(qa): relax heartbeat target none startup probe 2026-06-23 08:18:55 +02:00
Vincent Koc
53f9b6a36b test(qa): align release memory scenario assertions 2026-06-23 07:43:06 +02:00
Dallin Romney
438f208a76 perf(qa-lab): speed up unified QA suites (#95944)
* perf(qa-lab): speed up smoke ci suite

* fix(qa-lab): satisfy suite scheduler lint

* fix(qa-lab): settle unified partitions before retry

* fix(qa-lab): preserve isolated suite safeguards

* refactor(qa-lab): make suite isolation explicit

* fix(qa-lab): preserve channel-driver suite serialization

* fix(qa-lab): narrow flow-only isolation metadata
2026-06-22 21:55:54 -07:00
Vincent Koc
495a4f9b8e test(qa): accept verified live fanout completions 2026-06-23 06:46:40 +02:00
Dallin Romney
d3dc7aaa87 docs: update maturity scorecard (#95933)
* docs: update maturity scorecard

* docs: rerender maturity scorecard from all evidence
2026-06-22 21:37:03 -07:00
Vincent Koc
68a1e00b73 fix(agents): retry silent subagent completion handoffs 2026-06-23 06:04:16 +02:00
Vincent Koc
54b2243de3 test(qa): tighten release profile scenario waits 2026-06-23 06:04:16 +02:00
Vincent Koc
a9024741c2 test(qa): pin live artifact scenario contracts 2026-06-23 05:13:35 +02:00
Vincent Koc
d1b268f7f7 fix(qa): normalize completed wait envelopes 2026-06-23 05:13:35 +02:00
Vincent Koc
c48dd3cdd1 fix(ci): align maturity score source with taxonomy 2026-06-23 10:46:07 +08:00
Dallin Romney
b71ddbf1b4 ci: simplify maturity scorecard QA evidence inputs (#95898)
* ci: simplify maturity scorecard evidence inputs

* ci: keep maturity renderer defaults runnable

* ci: validate maturity evidence source

* ci: split maturity scorecard codex agent

* ci: remove codex copy from maturity evidence workflow

* ci: narrow maturity evidence workflow secrets
2026-06-22 19:24:43 -07:00
Vincent Koc
d716dfd532 test(qa): wait for live history replies in flow scenarios 2026-06-23 04:01:11 +02:00
Vincent Koc
def4b51485 fix(qa): gate smoke profile scenarios by channel driver 2026-06-23 09:34:52 +08:00
Vincent Koc
43f2b61f3b test(qa): keep image generation fixture on mock lane 2026-06-23 02:35:02 +02:00
Vincent Koc
086c629556 test(qa): scope provider-sensitive flow fixtures 2026-06-23 02:17:20 +02:00
Vincent Koc
befe04f465 test(qa): accept Sonnet max thinking support 2026-06-23 01:57:43 +02:00
Vincent Koc
264b37e9d2 test(qa): avoid redacted config cleanup patch 2026-06-23 01:39:39 +02:00
Dallin Romney
63b13ea837 feat(qa): crabline channel driver (#91502)
* feat(qa): add crabline channel driver seam

* feat: run crabline channel driver smoke

* chore: keep crabline qa dependency dev-only

* refactor(qa): keep crabline driver details opaque

* chore(qa): pin crabline to merged driver API

* feat(qa): drive channel driver from profiles

* fix(qa): declare crabline runtime peer

* feat(qa): resolve crabline channel from scenarios

* feat(qa): treat unsupported profile channels as coverage gaps

* Revert "feat(qa): treat unsupported profile channels as coverage gaps"

This reverts commit 65a9701655.

* fix(qa): adapt crabline driver to chat sdk cli

* refactor(qa): pass channel driver metadata directly

* chore(qa): update crabline provider pin

* chore(qa): default channel scenarios to driver

* chore: repair qa dependency lockfile

* chore: allow native qa dependency builds

* fix(qa): satisfy crabline driver lint

* fix(qa): satisfy crabline ci gates

* Use crabline transport for smoke QA profile

* fix(qa): keep crabline driver opt-in

* fix(qa): reuse crabline telegram driver token

* fix(qa): route smoke profile through crabline

* fix(qa): run full smoke profile lane

* fix(qa): remove smoke scenario workflow filter

* fix: stabilize crabline smoke qa profile

* fix: pin crabline qa dependency

* test: keep crabline smoke credential-free

* fix: skip visible reasoning lane for crabline smoke

* fix: unblock crabline qa ci

* Update crabline dependency

* Pin crabline to merged main

* Use Crabline fake provider servers
2026-06-22 15:24:59 -07:00
Vincent Koc
7cc21ef59d fix(qa): stabilize smoke-ci scenarios 2026-06-22 15:41:53 +08:00
Vincent Koc
7d4d8a7f3d test(agents): add large prompt cache coverage (#95653)
Adds large OpenAI and Anthropic prompt-cache live coverage plus a QA Lab long-context tool-result scenario.

Co-authored-by: vincentkoc <vincentkoc@users.noreply.github.com>
Reviewed-by: @vincentkoc
2026-06-22 11:39:37 +08:00
Dallin Romney
5dd30c3995 test: fold HTTP API script proof into QA Lab (#94700)
* test: fold HTTP API script proof into QA Lab

* test: remove folded HTTP API script tests

* test: relax QA native scenario catalog inventory

* test: trim folded QA Lab script cruft

* test: align folded QA coverage ids

* test: keep native QA evidence out of parity tiers

* test: update mirrored QA routing expectation

* test: preserve chat tools profile build guard

* test: avoid overclaiming gateway tool API coverage

* test: pin folded QA coverage ids
2026-06-20 23:10:35 -07:00
Jesse Merhi
5db2f6c1fc Add stdout diagnostics OTEL log exporter
Adds stdout and both-mode diagnostics OTEL log export, with focused QA Lab smoke coverage and docs/config updates.

Prepared head SHA: efa2ef07ab
Verification: CI 27808480969 passed for the prepared head.
Reviewed-by: @jesse-merhi
2026-06-19 16:06:37 +10:00
Dallin Romney
e12cf72b17 Standardize QA coverage IDs on dotted names (#94702)
* fix: standardize qa coverage ids

* test: avoid qa coverage id assertion spread
2026-06-18 17:25:26 -07:00
Dallin Romney
4ca0e52d0e test: fold channel message flows into qa e2e (#93174)
* test: fold channel flows into qa e2e

* test: keep channel flow skill pointed at qa

* test: move channel flow proof under telegram
2026-06-18 15:45:33 -07:00
Dallin Romney
c4ae2be947 fix: taxonomy coverage id cleanup (#94304)
* fix: split taxonomy coverage id features

* fix: clean taxonomy feature row names

* docs: clarify taxonomy coverage id semantics

* docs: tighten coverage id guidance

* fix: keep taxonomy features product shaped

* fix: narrow sdk artifact coverage bundle

* fix: name taxonomy coverage ids clearly

* fix: polish taxonomy feature descriptions
2026-06-18 15:16:58 -07:00
Colin Johnson
d5a27b0b96 test: add QA Lab UX Matrix evidence scenario (#94306)
* test: add qa lab ux matrix script scenario

* fix(qa-lab): annotate UX Matrix producer catch callback as unknown for oxlint

---------

Co-authored-by: Dallin Romney <dallinromney@gmail.com>
2026-06-18 14:10:06 -07:00
Dallin Romney
e17d111990 fix: require all taxonomy coverage ids (#94296) 2026-06-17 16:38:14 -07:00
Colin Johnson
591313e80a qa-lab: support script-backed evidence scenarios (#94276)
* qa: add script scenario execution kind

* fix(qa-lab): carry suite profile into script producer evidence and simplify artifact path resolution

* fix(qa-lab): keep out-of-repo producer artifacts absolute to avoid ../ traversal refs

---------

Co-authored-by: Dallin Romney <dallinromney@gmail.com>
2026-06-17 15:09:25 -07:00
Vincent Koc
8288b4d4c9 fix(qa-lab): stabilize web search parity fixture 2026-06-18 00:07:36 +02:00
Dallin Romney
f7e5132ffd test: fold gateway smoke into qa e2e (#93178) 2026-06-17 14:55:28 -07:00
Dallin Romney
fae4a01d0d test: fold otel smoke into qa e2e (#93181)
* test: fold otel smoke into qa e2e

* test: eliminate otel smoke script
2026-06-17 14:54:58 -07:00
Dallin Romney
0a6736af09 test: fold lifecycle and package proof into QA Lab (#93114)
* test: fold script coverage into qa scenarios

* test: migrate script checks into qa e2e

* test: point qa code refs at migrated e2e

* test: fold plugin lifecycle probe into qa e2e

* test: use shared temp dirs in plugin lifecycle probe

* test: fold plugin lifecycle sweep into qa lab

* test: trim lifecycle docker text assertions

* test: keep followup script conversions split

* test: make lifecycle docker runner script-safe

* test: update changed helper routing expectation
2026-06-17 14:22:04 -07:00
Dallin Romney
450060d7a2 test(qa): expand smoke-ci and release categories and coverage (#93175)
* test(qa): add smoke ci primary coverage evidence

* test(qa): remove overstated primary coverage claims

* test(qa): make release profile include smoke ci

* test(qa): trim taxonomy formatting churn

* test(qa): avoid hardcoded profile names in coverage test

* test(qa): make release profile cover taxonomy

* test(qa): type profile fixture all category flag

* test(qa): include channel delivery in smoke ci profile
2026-06-15 18:05:52 -07:00
Dallin Romney
fef8394079 Convert QA scenarios to YAML files (#92915)
* refactor: load QA scenarios from YAML

* docs: update personal QA scenario docs

* test: keep QA scenarios YAML-only
2026-06-14 17:31:18 -07:00
NVIDIAN
ecaebfc51b fix(agents): retry thinking-only errored turns (#92191)
Retry replay-safe reasoning-only provider errors before assistant failover while preserving classified fallback and terminal-output ownership. Adds deterministic Anthropic gateway fault-injection coverage and focused regression tests.\n\nCo-authored-by: ai-hpc <mail.speedy.hpc@hotmail.com>
2026-06-14 09:39:27 -07:00
Dallin Romney
a3e9dfee0e Simplify QA scorecard mapping shape (#92558)
* test(qa): simplify scorecard mapping shape

* test(qa): use typed scorecard evidence refs

* test(qa): map scorecard categories by coverage id

* feat: align qa coverage with taxonomy features

* refactor: keep qa coverage ids canonical

* refactor: minimize qa coverage id churn

* test: align qa coverage id assertions

* test: update qa evidence coverage expectations

* refactor qa taxonomy coverage ids

* style qa taxonomy coverage ids

* test qa coverage lint fix

* test qa coverage type fix
2026-06-14 00:16:33 -07:00
Dallin Romney
561b293c7a Run Vitest and Playwright scenarios from qa suite (#92606)
* test(qa): run vitest and playwright scenarios from qa suite

* fix(qa): harden scenario suite dispatch

* refactor(qa): share scenario path utilities

* refactor(qa): share test file scenario runner

* refactor(qa): route test file scenarios through suite runtime

* refactor(qa): use explicit suite runtime result kind

* test(qa): write suite evidence artifact

* refactor(qa): clarify suite execution dispatch

* fix(qa): keep test-file scenarios out of flow-only runners

* refactor(qa): export mixed scenario suite runner
2026-06-13 01:06:10 -07:00
Vincent Koc
7d3e8dc963 test(qa): restore memory fallback config safely 2026-06-10 18:03:15 +09:00
Vincent Koc
2c146261a2 fix(qa): accept completed mock image turns 2026-06-10 17:48:33 +09:00
Vincent Koc
87abb8defb fix(qa): preserve Matrix recovery state in sqlite 2026-06-10 17:35:41 +09:00
Vincent Koc
3b180d5d99 fix(qa): wait for restart wake before capability check 2026-06-10 17:09:18 +09:00