Commit Graph

192 Commits

Author SHA1 Message Date
Dallin Romney
f6b2a5ffb4 test(qa): harden all-profile evidence scenarios (#96003) 2026-06-23 00:07:51 -07:00
Vincent Koc
f24b1a9c0c test(qa): relax heartbeat target none startup probe 2026-06-23 08:18:55 +02:00
Vincent Koc
53f9b6a36b test(qa): align release memory scenario assertions 2026-06-23 07:43:06 +02:00
Dallin Romney
438f208a76 perf(qa-lab): speed up unified QA suites (#95944)
* perf(qa-lab): speed up smoke ci suite

* fix(qa-lab): satisfy suite scheduler lint

* fix(qa-lab): settle unified partitions before retry

* fix(qa-lab): preserve isolated suite safeguards

* refactor(qa-lab): make suite isolation explicit

* fix(qa-lab): preserve channel-driver suite serialization

* fix(qa-lab): narrow flow-only isolation metadata
2026-06-22 21:55:54 -07:00
Vincent Koc
495a4f9b8e test(qa): accept verified live fanout completions 2026-06-23 06:46:40 +02:00
Dallin Romney
d3dc7aaa87 docs: update maturity scorecard (#95933)
* docs: update maturity scorecard

* docs: rerender maturity scorecard from all evidence
2026-06-22 21:37:03 -07:00
Vincent Koc
68a1e00b73 fix(agents): retry silent subagent completion handoffs 2026-06-23 06:04:16 +02:00
Vincent Koc
54b2243de3 test(qa): tighten release profile scenario waits 2026-06-23 06:04:16 +02:00
Vincent Koc
a9024741c2 test(qa): pin live artifact scenario contracts 2026-06-23 05:13:35 +02:00
Vincent Koc
d1b268f7f7 fix(qa): normalize completed wait envelopes 2026-06-23 05:13:35 +02:00
Vincent Koc
c48dd3cdd1 fix(ci): align maturity score source with taxonomy 2026-06-23 10:46:07 +08:00
Dallin Romney
b71ddbf1b4 ci: simplify maturity scorecard QA evidence inputs (#95898)
* ci: simplify maturity scorecard evidence inputs

* ci: keep maturity renderer defaults runnable

* ci: validate maturity evidence source

* ci: split maturity scorecard codex agent

* ci: remove codex copy from maturity evidence workflow

* ci: narrow maturity evidence workflow secrets
2026-06-22 19:24:43 -07:00
Vincent Koc
d716dfd532 test(qa): wait for live history replies in flow scenarios 2026-06-23 04:01:11 +02:00
Vincent Koc
def4b51485 fix(qa): gate smoke profile scenarios by channel driver 2026-06-23 09:34:52 +08:00
Vincent Koc
43f2b61f3b test(qa): keep image generation fixture on mock lane 2026-06-23 02:35:02 +02:00
Vincent Koc
086c629556 test(qa): scope provider-sensitive flow fixtures 2026-06-23 02:17:20 +02:00
Vincent Koc
befe04f465 test(qa): accept Sonnet max thinking support 2026-06-23 01:57:43 +02:00
Vincent Koc
264b37e9d2 test(qa): avoid redacted config cleanup patch 2026-06-23 01:39:39 +02:00
Dallin Romney
63b13ea837 feat(qa): crabline channel driver (#91502)
* feat(qa): add crabline channel driver seam

* feat: run crabline channel driver smoke

* chore: keep crabline qa dependency dev-only

* refactor(qa): keep crabline driver details opaque

* chore(qa): pin crabline to merged driver API

* feat(qa): drive channel driver from profiles

* fix(qa): declare crabline runtime peer

* feat(qa): resolve crabline channel from scenarios

* feat(qa): treat unsupported profile channels as coverage gaps

* Revert "feat(qa): treat unsupported profile channels as coverage gaps"

This reverts commit 65a9701655.

* fix(qa): adapt crabline driver to chat sdk cli

* refactor(qa): pass channel driver metadata directly

* chore(qa): update crabline provider pin

* chore(qa): default channel scenarios to driver

* chore: repair qa dependency lockfile

* chore: allow native qa dependency builds

* fix(qa): satisfy crabline driver lint

* fix(qa): satisfy crabline ci gates

* Use crabline transport for smoke QA profile

* fix(qa): keep crabline driver opt-in

* fix(qa): reuse crabline telegram driver token

* fix(qa): route smoke profile through crabline

* fix(qa): run full smoke profile lane

* fix(qa): remove smoke scenario workflow filter

* fix: stabilize crabline smoke qa profile

* fix: pin crabline qa dependency

* test: keep crabline smoke credential-free

* fix: skip visible reasoning lane for crabline smoke

* fix: unblock crabline qa ci

* Update crabline dependency

* Pin crabline to merged main

* Use Crabline fake provider servers
2026-06-22 15:24:59 -07:00
Vincent Koc
7cc21ef59d fix(qa): stabilize smoke-ci scenarios 2026-06-22 15:41:53 +08:00
Vincent Koc
7d4d8a7f3d test(agents): add large prompt cache coverage (#95653)
Adds large OpenAI and Anthropic prompt-cache live coverage plus a QA Lab long-context tool-result scenario.

Co-authored-by: vincentkoc <vincentkoc@users.noreply.github.com>
Reviewed-by: @vincentkoc
2026-06-22 11:39:37 +08:00
Dallin Romney
5dd30c3995 test: fold HTTP API script proof into QA Lab (#94700)
* test: fold HTTP API script proof into QA Lab

* test: remove folded HTTP API script tests

* test: relax QA native scenario catalog inventory

* test: trim folded QA Lab script cruft

* test: align folded QA coverage ids

* test: keep native QA evidence out of parity tiers

* test: update mirrored QA routing expectation

* test: preserve chat tools profile build guard

* test: avoid overclaiming gateway tool API coverage

* test: pin folded QA coverage ids
2026-06-20 23:10:35 -07:00
Jesse Merhi
5db2f6c1fc Add stdout diagnostics OTEL log exporter
Adds stdout and both-mode diagnostics OTEL log export, with focused QA Lab smoke coverage and docs/config updates.

Prepared head SHA: efa2ef07ab
Verification: CI 27808480969 passed for the prepared head.
Reviewed-by: @jesse-merhi
2026-06-19 16:06:37 +10:00
Dallin Romney
e12cf72b17 Standardize QA coverage IDs on dotted names (#94702)
* fix: standardize qa coverage ids

* test: avoid qa coverage id assertion spread
2026-06-18 17:25:26 -07:00
Dallin Romney
4ca0e52d0e test: fold channel message flows into qa e2e (#93174)
* test: fold channel flows into qa e2e

* test: keep channel flow skill pointed at qa

* test: move channel flow proof under telegram
2026-06-18 15:45:33 -07:00
Dallin Romney
c4ae2be947 fix: taxonomy coverage id cleanup (#94304)
* fix: split taxonomy coverage id features

* fix: clean taxonomy feature row names

* docs: clarify taxonomy coverage id semantics

* docs: tighten coverage id guidance

* fix: keep taxonomy features product shaped

* fix: narrow sdk artifact coverage bundle

* fix: name taxonomy coverage ids clearly

* fix: polish taxonomy feature descriptions
2026-06-18 15:16:58 -07:00
Colin Johnson
d5a27b0b96 test: add QA Lab UX Matrix evidence scenario (#94306)
* test: add qa lab ux matrix script scenario

* fix(qa-lab): annotate UX Matrix producer catch callback as unknown for oxlint

---------

Co-authored-by: Dallin Romney <dallinromney@gmail.com>
2026-06-18 14:10:06 -07:00
Dallin Romney
e17d111990 fix: require all taxonomy coverage ids (#94296) 2026-06-17 16:38:14 -07:00
Colin Johnson
591313e80a qa-lab: support script-backed evidence scenarios (#94276)
* qa: add script scenario execution kind

* fix(qa-lab): carry suite profile into script producer evidence and simplify artifact path resolution

* fix(qa-lab): keep out-of-repo producer artifacts absolute to avoid ../ traversal refs

---------

Co-authored-by: Dallin Romney <dallinromney@gmail.com>
2026-06-17 15:09:25 -07:00
Vincent Koc
8288b4d4c9 fix(qa-lab): stabilize web search parity fixture 2026-06-18 00:07:36 +02:00
Dallin Romney
f7e5132ffd test: fold gateway smoke into qa e2e (#93178) 2026-06-17 14:55:28 -07:00
Dallin Romney
fae4a01d0d test: fold otel smoke into qa e2e (#93181)
* test: fold otel smoke into qa e2e

* test: eliminate otel smoke script
2026-06-17 14:54:58 -07:00
Dallin Romney
0a6736af09 test: fold lifecycle and package proof into QA Lab (#93114)
* test: fold script coverage into qa scenarios

* test: migrate script checks into qa e2e

* test: point qa code refs at migrated e2e

* test: fold plugin lifecycle probe into qa e2e

* test: use shared temp dirs in plugin lifecycle probe

* test: fold plugin lifecycle sweep into qa lab

* test: trim lifecycle docker text assertions

* test: keep followup script conversions split

* test: make lifecycle docker runner script-safe

* test: update changed helper routing expectation
2026-06-17 14:22:04 -07:00
Dallin Romney
450060d7a2 test(qa): expand smoke-ci and release categories and coverage (#93175)
* test(qa): add smoke ci primary coverage evidence

* test(qa): remove overstated primary coverage claims

* test(qa): make release profile include smoke ci

* test(qa): trim taxonomy formatting churn

* test(qa): avoid hardcoded profile names in coverage test

* test(qa): make release profile cover taxonomy

* test(qa): type profile fixture all category flag

* test(qa): include channel delivery in smoke ci profile
2026-06-15 18:05:52 -07:00
Dallin Romney
fef8394079 Convert QA scenarios to YAML files (#92915)
* refactor: load QA scenarios from YAML

* docs: update personal QA scenario docs

* test: keep QA scenarios YAML-only
2026-06-14 17:31:18 -07:00
NVIDIAN
ecaebfc51b fix(agents): retry thinking-only errored turns (#92191)
Retry replay-safe reasoning-only provider errors before assistant failover while preserving classified fallback and terminal-output ownership. Adds deterministic Anthropic gateway fault-injection coverage and focused regression tests.\n\nCo-authored-by: ai-hpc <mail.speedy.hpc@hotmail.com>
2026-06-14 09:39:27 -07:00
Dallin Romney
a3e9dfee0e Simplify QA scorecard mapping shape (#92558)
* test(qa): simplify scorecard mapping shape

* test(qa): use typed scorecard evidence refs

* test(qa): map scorecard categories by coverage id

* feat: align qa coverage with taxonomy features

* refactor: keep qa coverage ids canonical

* refactor: minimize qa coverage id churn

* test: align qa coverage id assertions

* test: update qa evidence coverage expectations

* refactor qa taxonomy coverage ids

* style qa taxonomy coverage ids

* test qa coverage lint fix

* test qa coverage type fix
2026-06-14 00:16:33 -07:00
Dallin Romney
561b293c7a Run Vitest and Playwright scenarios from qa suite (#92606)
* test(qa): run vitest and playwright scenarios from qa suite

* fix(qa): harden scenario suite dispatch

* refactor(qa): share scenario path utilities

* refactor(qa): share test file scenario runner

* refactor(qa): route test file scenarios through suite runtime

* refactor(qa): use explicit suite runtime result kind

* test(qa): write suite evidence artifact

* refactor(qa): clarify suite execution dispatch

* fix(qa): keep test-file scenarios out of flow-only runners

* refactor(qa): export mixed scenario suite runner
2026-06-13 01:06:10 -07:00
Vincent Koc
7d3e8dc963 test(qa): restore memory fallback config safely 2026-06-10 18:03:15 +09:00
Vincent Koc
2c146261a2 fix(qa): accept completed mock image turns 2026-06-10 17:48:33 +09:00
Vincent Koc
87abb8defb fix(qa): preserve Matrix recovery state in sqlite 2026-06-10 17:35:41 +09:00
Vincent Koc
3b180d5d99 fix(qa): wait for restart wake before capability check 2026-06-10 17:09:18 +09:00
brokemac79
de4b8d8ebf feat(plugins): allow installed trusted policy contracts
Allow explicitly enabled installed plugins to register declared trusted tool policies and agent tool result middleware, with trusted policy ids scoped by plugin owner.\n\nVerification covered targeted plugin/agent tests, typecheck, build, lint, local autoreview, and a Blacksmith Testbox runtime proof (tbx_01ktr1nq0rhq47fjkwrepm7fd3).
2026-06-10 16:18:23 +10:00
Vincent Koc
c350c35fad fix(release): allow QA capability restore patch
(cherry picked from commit db711701d2)
2026-06-10 08:27:59 +09:00
Vincent Koc
50130d32a9 test(release): align qa tool coverage gate 2026-06-09 01:02:24 +02:00
Vincent Koc
c7b01cf201 test(release): stabilize qa runtime parity gate 2026-06-09 01:02:24 +02:00
Peter Steinberger
b8d08f0cfd docs: document repository scripts 2026-06-04 20:52:50 -04:00
Vincent Koc
a9f099d279 test(qa): require channel scenario markers 2026-06-03 14:27:25 +02:00
Bek
bce3d5bf92 trace: Correlate channel diagnostics into one trace
Correlates channel receive, agent lifecycle, model attempt diagnostics, and outbound delivery diagnostics into one trace waterfall so channel message runs can be inspected end-to-end.

Maintainer follow-up removed the internal `AgentHarnessV2` adapter surface and kept the harness path canonical through `src/agents/harness/lifecycle.ts`.

Proof:
- PR checks passed on `04e9189c15480d53663d533a04c9883164b4dd54`.
- `node scripts/run-vitest.mjs src/agents/harness/lifecycle.test.ts src/agents/harness/selection.test.ts src/channels/turn/kernel.test.ts`
- `pnpm check:changed` Testbox `tbx_01kt3xtrm70qc7nb90cqv5rah1`

Thanks @bek91.

Co-authored-by: Bek <bek.akhmedov@gmail.com>
2026-06-02 06:38:00 -04:00
Vincent Koc
4550cfa6a7 fix(qa): run plugin MCP probes from repo root 2026-06-01 07:13:24 +02:00