A command-line interface for extracting data from PDF files using the pdf-parse library.
The CLI tool is included with the pdf-parse package. If you have pdf-parse installed, the CLI is available as pdf-parse.
npm install -g pdf-parse
To update to the latest version:
npm update -g pdf-parse
To remove the CLI tool:
npm uninstall -g pdf-parse
pdf-parse <command> <file> [options]
Where <file> can be a local PDF file path or a URL (for certain commands).
Check PDF file headers and validate format. Only works with URLs.
pdf-parse check https://example.com/document.pdf
Extract PDF metadata and information.
pdf-parse info document.pdf
Extract text content from PDF pages.
pdf-parse text document.pdf --pages 1-3
Extract embedded images from PDF pages.
pdf-parse image document.pdf --output ./images/
Generate screenshots of PDF pages.
pdf-parse screenshot document.pdf --output ./screenshots/ --scale 2.0
Extract tabular data from PDF pages.
pdf-parse table document.pdf --format json
-o, --output <file>: Output file path (for single file) or directory (for multiple files)-p, --pages <range>: Page range (e.g., 1,3-5,7)-f, --format <format>: Output format (json, text, dataurl)-m, --min <px>: Minimum image size threshold in pixels (default: 80)-s, --scale <factor>: Scale factor for screenshots (default: 1.0)-w, --width <px>: Desired width for screenshots in pixels-l, --large: Enable optimizations for large PDF files--magic: Validate PDF magic bytes-h, --help: Show help message-v, --version: Show version numberGet PDF information:
pdf-parse info mydocument.pdf
Extract text from specific pages:
pdf-parse text mydocument.pdf --pages 1,3-5
Extract all images to a directory:
pdf-parse image mydocument.pdf --output ./extracted-images/
Extract images with minimum size filter:
pdf-parse image mydocument.pdf --min 100 --output ./images/
Generate screenshots with custom scale:
pdf-parse screenshot mydocument.pdf --scale 1.5 --output ./screenshots/
Generate screenshots with specific width:
pdf-parse screenshot mydocument.pdf --width 800 --output ./screenshots/
Extract tables in JSON format:
pdf-parse table mydocument.pdf --format json --output tables.json
Extract tables from specific pages:
pdf-parse table mydocument.pdf --pages 2-4
Check PDF headers from URL:
pdf-parse check https://example.com/document.pdf
Check without magic byte validation:
pdf-parse check https://example.com/document.pdf --no-magic
For large PDF files (> 5MB), use the --large flag to enable performance optimizations:
pdf-parse text https://example.com/large-document.pdf --large --pages 1-10
pdf-parse info https://example.com/huge-report.pdf --large
The --large flag enables:
Human-readable text output for most commands.
Structured data output using --format json.
Base64 encoded data URLs for image and screenshot commands using --format dataurl.
Specify page ranges using comma-separated values and ranges:
1: Page 11,3,5: Pages 1, 3, and 51-5: Pages 1 through 51,3-5,7: Pages 1, 3, 4, 5, and 7The CLI tool provides clear error messages for common issues: