pdf

PDF manipulation toolkit. Extract text/tables, create PDFs, merge/split, fill forms, for programmatic document processing and analysis.

Install

Hot:55

Download and extract to your skills directory

Copy command and send to OpenClaw for auto-install:

Download and install this skill https://openskills.cc/api/download?slug=k-dense-ai-scientific-skills-pdf&locale=en&source=copy

PDF Processing Skills

Skill Overview


The PDF processing skill provides a complete solution for operating PDF documents, supporting programmatic document processing capabilities such as text and table extraction, PDF creation and merging, form filling, and more.

Applicable Scenarios

1. Data Extraction and Analysis


Extract structured data from PDFs such as commercial invoices, financial statements, and academic papers. Supports exporting tables to Excel for further analysis, particularly suitable for scenarios requiring batch processing of large numbers of documents.

2. Document Automation


Automatically merge multiple PDF files, split large files, add watermarks, and set password protection. Suitable for report generation, document archiving, bulk distribution, and other office automation scenarios.

3. Digitizing Scanned Documents


Use OCR technology to convert scanned PDFs into searchable, editable text, solving the problem of extracting text from digitized paper documents.

Core Features

Text and Table Extraction


Use the pdfplumber library to accurately extract text content and table data from PDFs, supporting preservation of original layout structure. Tables can be directly exported to Excel format for data analysis.

PDF Merging and Splitting


Use pypdf or command-line tools (qpdf, pdftk) to merge multiple PDF files or split a single PDF into multiple files by pages. Supports page rotation and metadata modification.

PDF Creation and Generation


Create PDFs from scratch using the reportlab library, supporting multi-page reports, text layout, and graphics drawing. Suitable for automated report generation and document output scenarios.

Form Filling and Processing


Support filling interactive PDF form fields to automate form data population and batch processing, suitable for automation of standardized forms.

OCR for Scanned Documents


Combine pytesseract and pdf2image to perform optical character recognition on scanned PDFs, converting text in images to editable text.

Document Security and Protection


Provide PDF encryption and password protection features, supporting user and owner passwords to control permissions such as opening, printing, and copying the document.

Frequently Asked Questions

How to extract table data from a PDF?


Use the pdfplumber library to accurately identify and extract table content from PDFs. Example code:
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        tables = page.extract_tables()

Extracted tables can be directly converted to DataFrames or exported to Excel, making it well-suited for processing structured documents like financial statements and invoices.

How to merge multiple PDF files?


It is recommended to use the pypdf library or the qpdf command-line tool. pypdf is suitable for integration into Python scripts, while qpdf is suitable for shell scripts and batch processing. Both can efficiently handle merging large numbers of files and support selecting page ranges and adjusting order.

How to extract text from scanned PDFs?


You need to use OCR technology. First convert the PDF to images with pdf2image, then use pytesseract for text recognition. For Chinese documents, install the corresponding Chinese language pack. Recognition accuracy depends on scan quality and text clarity.

How to create PDF documents with Python?


reportlab is the most commonly used Python PDF generation library, offering a complete API from low-level Canvas drawing to high-level document templates (Platypus). It is suitable for creating formatted documents such as reports, certificates, and invoices.

How to add watermarks to PDFs for protection?


Use pypdf's merge_page functionality to overlay a watermark page onto each page. The watermark can be text, an image, or another PDF page. This feature is commonly used for copyright protection and distribution control.

Which command-line tools are recommended?


  • qpdf: Powerful, supports merging, splitting, rotating, and decrypting

  • pdftotext: Fast text extraction, supports layout preservation

  • pdftk: Classic tool with comprehensive features

  • pdfimages: Extract embedded images from PDFs
  • These tools are suitable for server-side batch processing and automation script scenarios.