options | pdf-parse

Parse Parameters

Name	Type	Attributes	Description
`partial`	`Array<number>`	optional	Array of 1-based page numbers to parse. When provided, only these pages will be parsed and returned in the same order as specified. Example: `[1, 3, 5]`. Parse only one page: `[7]`.
`first`	`number`	optional	If set to a positive integer N, parse the first N pages (pages 1..N). Ignored when `partial` is provided. If both `first` and `last` are set, they define an explicit inclusive page range and only pages from `first` to `last` will be parsed. In that case `first` is treated as the starting page number and the "first N" semantics is ignored.
`last`	`number`	optional	If set to a positive integer N, parse the last N pages (pages total-N+1..total). Ignored when `partial` is provided. If both `first` and `last` are set, they define an explicit inclusive page range and only pages from `first` to `last` will be parsed. In that case `last` is treated as the ending page number and the "last N" semantics is ignored.
`parsePageInfo`	`boolean`	optional	When `true`, collect per-page metadata such as embedded links, title, page labels, and page dimensions; support for ISBN, DOI, abstract, and references is work in progress when `getInfo()` is used. Default: `false`.
`parseHyperlinks`	`boolean`	optional	When `true`, attempt to detect and include hyperlink annotations (e.g. URLs) associated with text. Detected links are formatted as Markdown inline links (for example: `[link text](https://example.com)`). Default: `false`.
`lineEnforce`	`boolean`	optional	When `true`, the extractor will try to enforce logical line breaks by inserting a newline between text items when the vertical distance between them exceeds `lineThreshold`. Useful to preserve paragraph/line structure when text items are emitted as separate segments by the PDF renderer. Default: `true`.
`lineThreshold`	`number`	optional	Threshold used to decide whether two nearby text items belong to different lines. A larger value makes the parser more likely to start a new line between items. Default: `4.6`.
`cellSeparator`	`string`	optional	String inserted between text items on the same line when a sufficiently large horizontal gap is detected (see `cellThreshold`). This is typically used to emulate a cell/column separator (for example, a tab). Example: `"\t"` to produce tab-separated cells. Default: `'\t'`.
`cellThreshold`	`number`	optional	Horizontal distance threshold used to decide when two text items on the same baseline should be considered separate cells (and thus separated by `cellSeparator`). A larger value produces fewer (wider) cells; smaller value creates more cell breaks. Default: `7`.
`pageJoiner`	`string`	optional	Optional string appended at the end of each page's extracted text to mark page boundaries. The string supports the placeholders `page_number` and `total_number`, which are substituted with the current page number and total page count respectively. If omitted or empty, no page boundary marker is added. Default: `'\n-- page_number of total_number --'`.
`itemJoiner`	`string`	optional	Optional string used to join text items when returning a page's text. If provided, the extractor will use this value to join the sequence of text items instead of the default empty-string joining behavior. Use this to insert a custom separator between every text item. Default: `undefined`.
`imageThreshold`	`number`	optional	Minimum image dimension (in pixels) for width or height. Images whose width or height is less than or equal to this value are ignored by `getImage()`. Use to filter out very small decorative or tracking images. Default: `80`. Disable: `0`.
`scale`	`number`	optional	Screenshot scale factor used by `getScreenshot()`. Use `1` for the original size, `1.5` for a 50% larger image, etc. Default: `1`.
`desiredWidth`	`number`	optional	Desired screenshot width in pixels for `getScreenshot()`. When set, the `scale` option is ignored. Default: `undefined`.
`imageDataUrl`	`boolean`	optional	When `true`, include images and screenshots as base64 data URL strings. Applies to both `getImage()` and `getScreenshot()`. Default: `true`.
`imageBuffer`	`boolean`	optional	When `true`, include images and screenshots as binary buffers. Applies to both `getImage()` and `getScreenshot()`. Default: `true`.
`includeMarkedContent`	`boolean`	optional	When `true`, include marked content items in the items array of TextContent. Enables capturing the PDF's "marked content" tags (MCID, role/props) and structural/accessibility information — e.g. semantic tagging, sectioning, spans, alternate/alternative text, etc. Turn it on when you need structure/tag information or to map text ↔ structure using MCIDs (for example with `page.getStructTree()`). For plain text extraction it's usually left `false` (trade-off: larger output/increased detail). Default: `false`.
`disableNormalization`	`boolean`	optional	When `true`, the text is not normalized in the worker thread. Normalize in worker (`false` recommended for plain text). Default: `false`.

Load Parameters

Name	Type	Attributes	Description
`url`	`string \| URL`	optional	The URL of the PDF.
`data`	`TypedArray \| ArrayBuffer \| Array<number> \| string`	optional	Binary PDF data. Use TypedArrays (e.g., `Uint8Array`) to improve memory usage. If PDF data is BASE64-encoded, use `atob()` to convert it to a binary string first. NOTE: If TypedArrays are used, they will generally be transferred to the worker thread, reducing main-thread memory usage but taking ownership of the array.
`httpHeaders`	`Object`	optional	Basic authentication headers.
`withCredentials`	`boolean`	optional	Indicates whether cross-site Access-Control requests should be made using credentials (e.g., cookies or auth headers). Default: `false`.
`password`	`string`	optional	For decrypting password-protected PDFs.
`length`	`number`	optional	The PDF file length. Used for progress reports and range requests.
`range`	`PDFDataRangeTransport`	optional	Allows using a custom range transport implementation.
`rangeChunkSize`	`number`	optional	Maximum number of bytes fetched per range request. Default: `65536` (`2^16`).
`worker`	`PDFWorker`	optional	The worker used for loading and parsing PDF data.
`verbosity`	`number`	optional	Controls logging level; use constants from `VerbosityLevel`.
`docBaseUrl`	`string`	optional	Base URL of the document, used to resolve relative URLs in annotations and outline items.
`cMapUrl`	`string`	optional	URL where predefined Adobe CMaps are located. Include trailing slash.
`cMapPacked`	`boolean`	optional	Specifies if Adobe CMaps are binary-packed. Default: `true`.
`CMapReaderFactory`	`Object`	optional	Factory for reading built-in CMap files. Default: `{DOMCMapReaderFactory}`.
`iccUrl`	`string`	optional	URL where predefined ICC profiles are located. Include trailing slash.
`useSystemFonts`	`boolean`	optional	If `true`, non-embedded fonts fall back to system fonts. Default: `true` in browsers, `false` in Node.js (unless `disableFontFace === true`, then always `false`).
`standardFontDataUrl`	`string`	optional	URL for standard font files. Include trailing slash.
`StandardFontDataFactory`	`Object`	optional	Factory for reading standard font files. Default: `{DOMStandardFontDataFactory}`.
`wasmUrl`	`string`	optional	URL for WebAssembly files. Include trailing slash.
`WasmFactory`	`Object`	optional	Factory for reading WASM files. Default: `{DOMWasmFactory}`.
`useWorkerFetch`	`boolean`	optional	Enable `fetch()` in worker thread for CMap/font/WASM files. If `true`, factory options are ignored. Default: `true` in browsers, `false` in Node.js.
`useWasm`	`boolean`	optional	Attempt to use WebAssembly for better performance (e.g., image decoding). Default: `true`.
`stopAtErrors`	`boolean`	optional	Reject promises (e.g., `getTextContent`) on parse errors instead of recovering partially. Default: `false`.
`maxImageSize`	`number`	optional	Max image size in total pixels (`width * height`). Use `-1` for no limit (default).
`isEvalSupported`	`boolean`	optional	Whether evaluating strings as JS is allowed (for PDF function performance). Default: `true`.
`isOffscreenCanvasSupported`	`boolean`	optional	Whether `OffscreenCanvas` can be used in worker. Default: `true` in browsers, `false` in Node.js.
`isImageDecoderSupported`	`boolean`	optional	Whether `ImageDecoder` can be used in worker. Default: `true` in browsers, `false` in Node.js. NOTE: Temporarily disabled in Chromium due to bugs: - Crashes with BMP decoder on huge images (issue 374807001) - Broken JPEGs with custom color profiles (issue 378869810)
`canvasMaxAreaInBytes`	`number`	optional	Used to determine when to resize images (via `OffscreenCanvas`). Use `-1` to use a slower fallback algorithm.
`disableFontFace`	`boolean`	optional	Disable `@font-face`/Font Loading API; use built-in glyph renderer instead. Default: `false` in browsers, `true` in Node.js.
`fontExtraProperties`	`boolean`	optional	Include extra (non-rendering) font properties when exporting font data from worker. Increases memory usage. Default: `false`.
`enableXfa`	`boolean`	optional	Render XFA forms if present. Default: `false`.
`ownerDocument`	`HTMLDocument`	optional	Explicit document context for creating elements and loading resources. Defaults to current document.
`disableRange`	`boolean`	optional	Disable range requests for PDF loading. Default: `false`.
`disableStream`	`boolean`	optional	Disable streaming PDF data. Default: `false`.
`disableAutoFetch`	`boolean`	optional	Disable pre-fetching of PDF data. Requires `disableStream: true` to work fully. Default: `false`.
`pdfBug`	`boolean`	optional	Enable debugging hooks (see `web/debugger.js`). Default: `false`.
`CanvasFactory`	`Object`	optional	Factory for creating canvases. Default: `{DOMCanvasFactory}`.
`FilterFactory`	`Object`	optional	Factory for creating SVG filters during rendering. Default: `{DOMFilterFactory}`.
`enableHWA`	`boolean`	optional	Enable hardware acceleration for rendering. Default: `false`.

Parse Parameters

Load Parameters

Settings

On This Page