refactor(pdf): move document extraction to plugin

* refactor(pdf): move document extraction to plugin * fix(deps): sync document extract lockfile * fix(pdf): harden document extraction plugin
2026-05-06 15:10:52 +00:00 · 2026-04-24 17:15:05 -07:00
parent 915931aa38
commit e3cba98f39
34 changed files with 1023 additions and 321 deletions
--- a/docs/gateway/openresponses-http-api.md
+++ b/docs/gateway/openresponses-http-api.md
@@ -172,8 +172,9 @@ Current behavior:
  rasterized into images and passed to the model, and the injected file block uses
  the placeholder `[PDF content rendered to images]`.

-PDF parsing uses the Node-friendly `pdfjs-dist` legacy build (no worker). The modern
-PDF.js build expects browser workers/DOM globals, so it is not used in the Gateway.
+PDF parsing is provided by the bundled `document-extract` plugin, which uses the
+Node-friendly `pdfjs-dist` legacy build (no worker). The modern PDF.js build
+expects browser workers/DOM globals, so it is not used in the Gateway.

 URL fetch defaults:

--- a/docs/tools/pdf.md
+++ b/docs/tools/pdf.md
@@ -112,7 +112,9 @@ Fallback details:
 - If text extraction succeeds but image extraction would require vision on a
  text-only model, OpenClaw drops the rendered images and continues with the
  extracted text.
- Extraction fallback requires `pdfjs-dist` (and `@napi-rs/canvas` for image rendering).
+- Extraction fallback uses the bundled `document-extract` plugin. The plugin owns
+  `pdfjs-dist`; `@napi-rs/canvas` is used only when image rendering fallback is
+  available.

 ## Config