pdf-parse
    Preparing search index...

    PDF-Parse CLI Tool

    A command-line interface for extracting data from PDF files using the pdf-parse library.

    The CLI tool is included with the pdf-parse package. If you have pdf-parse installed, the CLI is available as pdf-parse.

    npm install -g pdf-parse
    

    To update to the latest version:

    npm update -g pdf-parse
    

    To remove the CLI tool:

    npm uninstall -g pdf-parse
    
    pdf-parse <command> <file> [options]
    

    Where <file> can be a local PDF file path or a URL (for certain commands).

    Check PDF file headers and validate format. Only works with URLs.

    pdf-parse check https://example.com/document.pdf
    

    Extract PDF metadata and information.

    pdf-parse info document.pdf
    

    Extract text content from PDF pages.

    pdf-parse text document.pdf --pages 1-3
    

    Extract embedded images from PDF pages.

    pdf-parse image document.pdf --output ./images/
    

    Generate screenshots of PDF pages.

    pdf-parse screenshot document.pdf --output ./screenshots/ --scale 2.0
    

    Extract tabular data from PDF pages.

    pdf-parse table document.pdf --format json
    
    • -o, --output <file>: Output file path (for single file) or directory (for multiple files)
    • -p, --pages <range>: Page range (e.g., 1,3-5,7)
    • -f, --format <format>: Output format (json, text, dataurl)
    • -m, --min <px>: Minimum image size threshold in pixels (default: 80)
    • -s, --scale <factor>: Scale factor for screenshots (default: 1.0)
    • -w, --width <px>: Desired width for screenshots in pixels
    • -l, --large: Enable optimizations for large PDF files
    • --magic: Validate PDF magic bytes
    • -h, --help: Show help message
    • -v, --version: Show version number

    Get PDF information:

    pdf-parse info mydocument.pdf
    

    Extract text from specific pages:

    pdf-parse text mydocument.pdf --pages 1,3-5
    

    Extract all images to a directory:

    pdf-parse image mydocument.pdf --output ./extracted-images/
    

    Extract images with minimum size filter:

    pdf-parse image mydocument.pdf --min 100 --output ./images/
    

    Generate screenshots with custom scale:

    pdf-parse screenshot mydocument.pdf --scale 1.5 --output ./screenshots/
    

    Generate screenshots with specific width:

    pdf-parse screenshot mydocument.pdf --width 800 --output ./screenshots/
    

    Extract tables in JSON format:

    pdf-parse table mydocument.pdf --format json --output tables.json
    

    Extract tables from specific pages:

    pdf-parse table mydocument.pdf --pages 2-4
    

    Check PDF headers from URL:

    pdf-parse check https://example.com/document.pdf
    

    Check without magic byte validation:

    pdf-parse check https://example.com/document.pdf --no-magic
    

    For large PDF files (> 5MB), use the --large flag to enable performance optimizations:

    pdf-parse text https://example.com/large-document.pdf --large --pages 1-10
    pdf-parse info https://example.com/huge-report.pdf --large

    The --large flag enables:

    • Disabled auto-fetching of additional pages
    • Chunk-based loading instead of streaming
    • Optimized range request chunk size

    Human-readable text output for most commands.

    Structured data output using --format json.

    Base64 encoded data URLs for image and screenshot commands using --format dataurl.

    Specify page ranges using comma-separated values and ranges:

    • 1: Page 1
    • 1,3,5: Pages 1, 3, and 5
    • 1-5: Pages 1 through 5
    • 1,3-5,7: Pages 1, 3, 4, 5, and 7

    The CLI tool provides clear error messages for common issues:

    • Invalid commands or options
    • Missing required arguments
    • File not found or inaccessible
    • Invalid page ranges
    • Network errors for URL-based operations