| Name | Type | Attributes | Description |
|---|---|---|---|
partial |
Array<number> |
optional | Array of 1-based page numbers to parse. When provided, only these pages will be parsed and returned in the same order as specified. Example: [1, 3, 5]. Parse only one page: [7]. |
first |
number |
optional | If set to a positive integer N, parse the first N pages (pages 1..N). Ignored when partial is provided. If both first and last are set, they define an explicit inclusive page range and only pages from first to last will be parsed. In that case first is treated as the starting page number and the "first N" semantics is ignored. |
last |
number |
optional | If set to a positive integer N, parse the last N pages (pages total-N+1..total). Ignored when partial is provided. If both first and last are set, they define an explicit inclusive page range and only pages from first to last will be parsed. In that case last is treated as the ending page number and the "last N" semantics is ignored. |
parsePageInfo |
boolean |
optional | When true, collect per-page metadata such as embedded links, title, page labels, and page dimensions; support for ISBN, DOI, abstract, and references is work in progress when getInfo() is used. Default: false. |
parseHyperlinks |
boolean |
optional | When true, attempt to detect and include hyperlink annotations (e.g. URLs) associated with text. Detected links are formatted as Markdown inline links (for example: [link text](https://example.com)). Default: false. |
lineEnforce |
boolean |
optional | When true, the extractor will try to enforce logical line breaks by inserting a newline between text items when the vertical distance between them exceeds lineThreshold. Useful to preserve paragraph/line structure when text items are emitted as separate segments by the PDF renderer. Default: true. |
lineThreshold |
number |
optional | Threshold used to decide whether two nearby text items belong to different lines. A larger value makes the parser more likely to start a new line between items. Default: 4.6. |
cellSeparator |
string |
optional | String inserted between text items on the same line when a sufficiently large horizontal gap is detected (see cellThreshold). This is typically used to emulate a cell/column separator (for example, a tab). Example: "\t" to produce tab-separated cells. Default: '\t'. |
cellThreshold |
number |
optional | Horizontal distance threshold used to decide when two text items on the same baseline should be considered separate cells (and thus separated by cellSeparator). A larger value produces fewer (wider) cells; smaller value creates more cell breaks. Default: 7. |
pageJoiner |
string |
optional | Optional string appended at the end of each page's extracted text to mark page boundaries. The string supports the placeholders page_number and total_number, which are substituted with the current page number and total page count respectively. If omitted or empty, no page boundary marker is added. Default: '\n-- page_number of total_number --'. |
itemJoiner |
string |
optional | Optional string used to join text items when returning a page's text. If provided, the extractor will use this value to join the sequence of text items instead of the default empty-string joining behavior. Use this to insert a custom separator between every text item. Default: undefined. |
imageThreshold |
number |
optional | Minimum image dimension (in pixels) for width or height. Images whose width or height is less than or equal to this value are ignored by getImage(). Use to filter out very small decorative or tracking images. Default: 80. Disable: 0. |
scale |
number |
optional | Screenshot scale factor used by getScreenshot(). Use 1 for the original size, 1.5 for a 50% larger image, etc. Default: 1. |
desiredWidth |
number |
optional | Desired screenshot width in pixels for getScreenshot(). When set, the scale option is ignored. Default: undefined. |
imageDataUrl |
boolean |
optional | When true, include images and screenshots as base64 data URL strings. Applies to both getImage() and getScreenshot(). Default: true. |
imageBuffer |
boolean |
optional | When true, include images and screenshots as binary buffers. Applies to both getImage() and getScreenshot(). Default: true. |
includeMarkedContent |
boolean |
optional | When true, include marked content items in the items array of TextContent. Enables capturing the PDF's "marked content" tags (MCID, role/props) and structural/accessibility information — e.g. semantic tagging, sectioning, spans, alternate/alternative text, etc. Turn it on when you need structure/tag information or to map text ↔ structure using MCIDs (for example with page.getStructTree()). For plain text extraction it's usually left false (trade-off: larger output/increased detail). Default: false. |
disableNormalization |
boolean |
optional | When true, the text is not normalized in the worker thread. Normalize in worker (false recommended for plain text). Default: false. |
| Name | Type | Attributes | Description |
|---|---|---|---|
url |
string | URL |
optional | The URL of the PDF. |
data |
TypedArray | ArrayBuffer | Array<number> | string |
optional | Binary PDF data. Use TypedArrays (e.g., Uint8Array) to improve memory usage. If PDF data is BASE64-encoded, use atob() to convert it to a binary string first. NOTE: If TypedArrays are used, they will generally be transferred to the worker thread, reducing main-thread memory usage but taking ownership of the array. |
httpHeaders |
Object |
optional | Basic authentication headers. |
withCredentials |
boolean |
optional | Indicates whether cross-site Access-Control requests should be made using credentials (e.g., cookies or auth headers). Default: false. |
password |
string |
optional | For decrypting password-protected PDFs. |
length |
number |
optional | The PDF file length. Used for progress reports and range requests. |
range |
PDFDataRangeTransport |
optional | Allows using a custom range transport implementation. |
rangeChunkSize |
number |
optional | Maximum number of bytes fetched per range request. Default: 65536 (2^16). |
worker |
PDFWorker |
optional | The worker used for loading and parsing PDF data. |
verbosity |
number |
optional | Controls logging level; use constants from VerbosityLevel. |
docBaseUrl |
string |
optional | Base URL of the document, used to resolve relative URLs in annotations and outline items. |
cMapUrl |
string |
optional | URL where predefined Adobe CMaps are located. Include trailing slash. |
cMapPacked |
boolean |
optional | Specifies if Adobe CMaps are binary-packed. Default: true. |
CMapReaderFactory |
Object |
optional | Factory for reading built-in CMap files. Default: {DOMCMapReaderFactory}. |
iccUrl |
string |
optional | URL where predefined ICC profiles are located. Include trailing slash. |
useSystemFonts |
boolean |
optional | If true, non-embedded fonts fall back to system fonts. Default: true in browsers, false in Node.js (unless disableFontFace === true, then always false). |
standardFontDataUrl |
string |
optional | URL for standard font files. Include trailing slash. |
StandardFontDataFactory |
Object |
optional | Factory for reading standard font files. Default: {DOMStandardFontDataFactory}. |
wasmUrl |
string |
optional | URL for WebAssembly files. Include trailing slash. |
WasmFactory |
Object |
optional | Factory for reading WASM files. Default: {DOMWasmFactory}. |
useWorkerFetch |
boolean |
optional | Enable fetch() in worker thread for CMap/font/WASM files. If true, factory options are ignored. Default: true in browsers, false in Node.js. |
useWasm |
boolean |
optional | Attempt to use WebAssembly for better performance (e.g., image decoding). Default: true. |
stopAtErrors |
boolean |
optional | Reject promises (e.g., getTextContent) on parse errors instead of recovering partially. Default: false. |
maxImageSize |
number |
optional | Max image size in total pixels (width * height). Use -1 for no limit (default). |
isEvalSupported |
boolean |
optional | Whether evaluating strings as JS is allowed (for PDF function performance). Default: true. |
isOffscreenCanvasSupported |
boolean |
optional | Whether OffscreenCanvas can be used in worker. Default: true in browsers, false in Node.js. |
isImageDecoderSupported |
boolean |
optional | Whether ImageDecoder can be used in worker. Default: true in browsers, false in Node.js. NOTE: Temporarily disabled in Chromium due to bugs:- Crashes with BMP decoder on huge images (issue 374807001) - Broken JPEGs with custom color profiles (issue 378869810) |
canvasMaxAreaInBytes |
number |
optional | Used to determine when to resize images (via OffscreenCanvas). Use -1 to use a slower fallback algorithm. |
disableFontFace |
boolean |
optional | Disable @font-face/Font Loading API; use built-in glyph renderer instead. Default: false in browsers, true in Node.js. |
fontExtraProperties |
boolean |
optional | Include extra (non-rendering) font properties when exporting font data from worker. Increases memory usage. Default: false. |
enableXfa |
boolean |
optional | Render XFA forms if present. Default: false. |
ownerDocument |
HTMLDocument |
optional | Explicit document context for creating elements and loading resources. Defaults to current document. |
disableRange |
boolean |
optional | Disable range requests for PDF loading. Default: false. |
disableStream |
boolean |
optional | Disable streaming PDF data. Default: false. |
disableAutoFetch |
boolean |
optional | Disable pre-fetching of PDF data. Requires disableStream: true to work fully. Default: false. |
pdfBug |
boolean |
optional | Enable debugging hooks (see web/debugger.js). Default: false. |
CanvasFactory |
Object |
optional | Factory for creating canvases. Default: {DOMCanvasFactory}. |
FilterFactory |
Object |
optional | Factory for creating SVG filters during rendering. Default: {DOMFilterFactory}. |
enableHWA |
boolean |
optional | Enable hardware acceleration for rendering. Default: false. |