refactor(pdf): move document extraction to plugin

* refactor(pdf): move document extraction to plugin

* fix(deps): sync document extract lockfile

* fix(pdf): harden document extraction plugin
This commit is contained in:
Vincent Koc
2026-04-24 17:15:05 -07:00
committed by GitHub
parent 915931aa38
commit e3cba98f39
34 changed files with 1023 additions and 321 deletions

View File

@@ -172,8 +172,9 @@ Current behavior:
rasterized into images and passed to the model, and the injected file block uses
the placeholder `[PDF content rendered to images]`.
PDF parsing uses the Node-friendly `pdfjs-dist` legacy build (no worker). The modern
PDF.js build expects browser workers/DOM globals, so it is not used in the Gateway.
PDF parsing is provided by the bundled `document-extract` plugin, which uses the
Node-friendly `pdfjs-dist` legacy build (no worker). The modern PDF.js build
expects browser workers/DOM globals, so it is not used in the Gateway.
URL fetch defaults:

View File

@@ -112,7 +112,9 @@ Fallback details:
- If text extraction succeeds but image extraction would require vision on a
text-only model, OpenClaw drops the rendered images and continues with the
extracted text.
- Extraction fallback requires `pdfjs-dist` (and `@napi-rs/canvas` for image rendering).
- Extraction fallback uses the bundled `document-extract` plugin. The plugin owns
`pdfjs-dist`; `@napi-rs/canvas` is used only when image rendering fallback is
available.
## Config