PDF processing skills - full functionality for merging, splitting, extracting, and OCR

PDF Processing Skill

Skill Overview

The PDF skill is a comprehensive PDF file processing solution that supports a variety of operations such as reading, extraction, merging, splitting, encryption, and OCR.

Applicable Scenarios

Document Automation

When you need to batch merge or split a large number of PDF files, this skill provides efficient Python libraries and command-line tools to enable automated processing.

Data Extraction and Analysis

Extract text and structured data from PDF reports, invoices, and tables, with support for exporting tables to Excel for analysis.

Digitizing Scanned Documents

Perform OCR on scanned PDFs to convert image-based PDFs into searchable, editable text documents.

Core Features

Basic PDF Operations

Supports common operations such as merging multiple PDF files, splitting documents by pages, rotating pages, adding/removing watermarks, and setting password protection. These can be accomplished using pypdf or the command-line tool qpdf.

Content Extraction

Use pdfplumber to accurately extract text content and table data from PDFs, with support for preserving the original layout and exporting to structured data formats.

PDF Creation and Form Filling

Create new PDF documents using reportlab, supporting multi-page report generation; supports filling PDF form fields to enable automated form processing.

Frequently Asked Questions

What operations does the PDF skill support?

The skill supports reading and extracting PDF text/tables, merging/splitting PDFs, rotating pages, adding watermarks, creating new PDFs, filling forms, encrypting/decrypting, extracting images, and OCR text recognition for scanned PDFs.

How do I merge multiple PDF files?

You can use Python's pypdf library or the command-line tool qpdf. Python example: create a PdfWriter object, iterate through each PDF file's pages and add them to the writer, then save the merged file.

How do I extract text from scanned PDFs?

You need to use OCR technology. The recommended approach is to convert the PDF to images using pdf2image, then perform text recognition with pytesseract. Install dependencies with: pip install pytesseract pdf2image, and you also need to install the system-level Tesseract engine.

Which Python libraries should I use for PDF processing?

Different libraries are recommended depending on the task: use pypdf for basic operations (merge/split/encrypt); use pdfplumber for text and table extraction; use reportlab to create new PDFs; use qpdf or pdftotext for command-line batch processing.

How do I fill PDF forms?

See the FORMS.md document. The skill supports filling PDF form fields using pdf-lib (JavaScript) or pypdf (Python); refer to the form handling guide for implementation details.

Does it support command-line PDF processing?

Yes, it supports various command-line tools. Common ones include: qpdf (merge/split/rotate/decrypt), pdftotext (text extraction), pdftk (multi-purpose tool), and pdfimages (image extraction). These tools are suitable for batch processing in scripts.

pdf

Author

Category

Install