PDF manipulation toolkit. Extract text/tables, create PDFs, merge/split, fill forms, for programmatic document processing and analysis.
Author
Category
Document ProcessingInstall
Hot:55
Download and extract to your skills directory
Copy command and send to OpenClaw for auto-install:
Download and install this skill https://openskills.cc/api/download?slug=k-dense-ai-scientific-skills-pdf&locale=en&source=copy
PDF Processing Skills
Skill Overview
The PDF processing skill provides a complete solution for operating PDF documents, supporting programmatic document processing capabilities such as text and table extraction, PDF creation and merging, form filling, and more.
Applicable Scenarios
1. Data Extraction and Analysis
Extract structured data from PDFs such as commercial invoices, financial statements, and academic papers. Supports exporting tables to Excel for further analysis, particularly suitable for scenarios requiring batch processing of large numbers of documents.
2. Document Automation
Automatically merge multiple PDF files, split large files, add watermarks, and set password protection. Suitable for report generation, document archiving, bulk distribution, and other office automation scenarios.
3. Digitizing Scanned Documents
Use OCR technology to convert scanned PDFs into searchable, editable text, solving the problem of extracting text from digitized paper documents.
Core Features
Text and Table Extraction
Use the pdfplumber library to accurately extract text content and table data from PDFs, supporting preservation of original layout structure. Tables can be directly exported to Excel format for data analysis.
PDF Merging and Splitting
Use pypdf or command-line tools (qpdf, pdftk) to merge multiple PDF files or split a single PDF into multiple files by pages. Supports page rotation and metadata modification.
PDF Creation and Generation
Create PDFs from scratch using the reportlab library, supporting multi-page reports, text layout, and graphics drawing. Suitable for automated report generation and document output scenarios.
Form Filling and Processing
Support filling interactive PDF form fields to automate form data population and batch processing, suitable for automation of standardized forms.
OCR for Scanned Documents
Combine pytesseract and pdf2image to perform optical character recognition on scanned PDFs, converting text in images to editable text.
Document Security and Protection
Provide PDF encryption and password protection features, supporting user and owner passwords to control permissions such as opening, printing, and copying the document.
Frequently Asked Questions
How to extract table data from a PDF?
Use the pdfplumber library to accurately identify and extract table content from PDFs. Example code:
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
tables = page.extract_tables()Extracted tables can be directly converted to DataFrames or exported to Excel, making it well-suited for processing structured documents like financial statements and invoices.
How to merge multiple PDF files?
It is recommended to use the pypdf library or the qpdf command-line tool. pypdf is suitable for integration into Python scripts, while qpdf is suitable for shell scripts and batch processing. Both can efficiently handle merging large numbers of files and support selecting page ranges and adjusting order.
How to extract text from scanned PDFs?
You need to use OCR technology. First convert the PDF to images with pdf2image, then use pytesseract for text recognition. For Chinese documents, install the corresponding Chinese language pack. Recognition accuracy depends on scan quality and text clarity.
How to create PDF documents with Python?
reportlab is the most commonly used Python PDF generation library, offering a complete API from low-level Canvas drawing to high-level document templates (Platypus). It is suitable for creating formatted documents such as reports, certificates, and invoices.
How to add watermarks to PDFs for protection?
Use pypdf's merge_page functionality to overlay a watermark page onto each page. The watermark can be text, an image, or another PDF page. This feature is commonly used for copyright protection and distribution control.
Which command-line tools are recommended?
These tools are suitable for server-side batch processing and automation script scenarios.