PDF to Text Converter¶
Convert PDF documents to plain text with advanced extraction options, page range selection, and metadata preservation.
Quick Start¶
# Basic conversion
pyforge convert document.pdf
# With page range
pyforge convert document.pdf --pages "1-10"
# With metadata
pyforge convert document.pdf --metadata
# Custom output file
pyforge convert document.pdf extracted_text.txt
Overview¶
The PDF to Text converter uses PyMuPDF (fitz) to extract text from PDF documents with high accuracy and performance. It supports:
- Text Extraction: High-quality text extraction preserving formatting
- Page Selection: Convert specific pages or page ranges
- Metadata Extraction: Include document metadata in output
- Layout Preservation: Maintain basic text layout and structure
- Error Recovery: Handle corrupted or complex PDF files
Command Syntax¶
Basic Examples¶
# Convert entire PDF
pyforge convert report.pdf
# Specify output file
pyforge convert report.pdf extracted_report.txt
# Convert with progress tracking
pyforge convert large_document.pdf --verbose
Page Selection¶
Control which pages to convert using the --pages
option:
Page Range Syntax¶
Syntax | Description | Example |
---|---|---|
"1-10" |
Pages 1 through 10 | --pages "1-10" |
"5-" |
Page 5 to end of document | --pages "5-" |
"-10" |
First 10 pages | --pages "-10" |
"1,3,5" |
Specific pages only | --pages "1,3,5" |
"1-5,10-15" |
Multiple ranges | --pages "1-5,10-15" |
Page Selection Examples¶
# First 5 pages
pyforge convert manual.pdf --pages "-5"
# Pages 10 to 20
pyforge convert manual.pdf --pages "10-20"
# From page 25 to end
pyforge convert manual.pdf --pages "25-"
# Specific pages
pyforge convert manual.pdf --pages "1,5,10,25"
# Multiple ranges
pyforge convert manual.pdf --pages "1-3,10-12,20-25"
# Complex selection
pyforge convert manual.pdf summary.txt --pages "1,3-7,15,20-"
Metadata Options¶
Include document metadata in the output using --metadata
:
# Include metadata
pyforge convert document.pdf --metadata
# Combine with page selection
pyforge convert document.pdf --pages "1-10" --metadata
Metadata Information¶
When --metadata
is enabled, the output includes:
- Document title
- Author information
- Creation and modification dates
- Page count
- File size
- PDF version
- Security settings
Example Output with Metadata:
========================================
PDF METADATA
========================================
Title: Annual Report 2023
Author: Finance Department
Creator: Microsoft Word
Producer: Adobe PDF Library
Creation Date: 2023-12-01 14:30:25
Modification Date: 2023-12-15 09:45:12
Pages: 45
File Size: 2.4 MB
PDF Version: 1.7
Security: Not Encrypted
========================================
[Document text content follows...]
Advanced Options¶
Output Control¶
# Force overwrite existing files
pyforge convert document.pdf --force
# Specify custom output location
pyforge convert document.pdf /path/to/output.txt
# Verbose output for debugging
pyforge convert document.pdf --verbose
Error Handling¶
# Attempt to process corrupted PDFs
pyforge convert damaged.pdf --force
# Skip problematic pages
pyforge convert complex.pdf --skip-errors
Text Extraction Quality¶
What Works Well¶
- Standard Text: Regular paragraphs and headings
- Tables: Simple table structures (converted to aligned text)
- Lists: Bulleted and numbered lists
- Headers/Footers: Page headers and footers
- Multi-column: Basic multi-column layouts
Limitations¶
- Complex Layouts: Heavily formatted documents may lose structure
- Images: Text within images is not extracted (OCR not included)
- Forms: Interactive form fields may not be captured
- Annotations: Comments and annotations are not included
- Embedded Objects: Charts, diagrams converted to placeholder text
Quality Tips¶
Best Results
For the best text extraction:
- Use PDFs created from text documents (not scanned images)
- Prefer PDFs with selectable text
- Avoid heavily graphical or artistic layouts
- Consider the source application (Word docs convert better than InDesign layouts)
File Information¶
Get detailed information about a PDF before conversion:
# Basic file info
pyforge info document.pdf
# Detailed information
pyforge info document.pdf --verbose
Example Output:
📄 File: annual_report.pdf
📊 Type: PDF Document
📏 Size: 2.4 MB
📋 Pages: 45
🔒 Encrypted: No
📝 Text Extractable: Yes
🎨 Has Images: Yes
📑 Has Forms: No
┌─────────────┬─────────────────────────┐
│ Property │ Value │
├─────────────┼─────────────────────────┤
│ Title │ Annual Report 2023 │
│ Author │ Finance Department │
│ Creator │ Microsoft Word │
│ Producer │ Adobe PDF Library │
│ Created │ 2023-12-01 14:30:25 │
│ Modified │ 2023-12-15 09:45:12 │
│ PDF Version │ 1.7 │
└─────────────┴─────────────────────────┘
Validation¶
Validate PDF files before conversion:
# Check if file can be processed
pyforge validate document.pdf
# Detailed validation
pyforge validate document.pdf --verbose
Performance¶
Processing Speed¶
Document Type | Pages | Typical Speed |
---|---|---|
Text-heavy | 1-50 | 10-50 pages/sec |
Mixed content | 1-50 | 5-20 pages/sec |
Image-heavy | 1-50 | 2-10 pages/sec |
Large files | 100+ | 5-15 pages/sec |
Memory Usage¶
- Small PDFs (< 10 MB): 50-100 MB RAM
- Medium PDFs (10-100 MB): 100-500 MB RAM
- Large PDFs (> 100 MB): 500 MB - 2 GB RAM
Optimization Tips¶
Large File Processing
For large PDF files:
Common Use Cases¶
Legal Document Processing¶
# Extract contract text
pyforge convert contract.pdf --pages "1-10" --metadata
# Process multiple legal documents
for file in contracts/*.pdf; do
pyforge convert "$file" "processed/$(basename "$file" .pdf).txt"
done
Research Paper Processing¶
# Extract paper content (skip references)
pyforge convert research_paper.pdf --pages "1-25"
# Include metadata for citation
pyforge convert research_paper.pdf --metadata
Report Processing¶
# Extract executive summary
pyforge convert annual_report.pdf summary.txt --pages "3-8"
# Full report with metadata
pyforge convert annual_report.pdf --metadata --verbose
Troubleshooting¶
Common Issues¶
Issue | Symptoms | Solution |
---|---|---|
Encrypted PDF | "Password required" error | Decrypt PDF first or provide password option |
Corrupted File | "Invalid PDF" error | Try --force option |
No Text Output | Empty or minimal text | PDF may be image-based (needs OCR) |
Garbled Text | Strange characters | Check PDF encoding/font issues |
Memory Error | Process crashes | Reduce page range or close other applications |
Troubleshooting Commands¶
# Check file validity
pyforge validate problematic.pdf
# Try force processing
pyforge convert problematic.pdf --force
# Get detailed file information
pyforge info problematic.pdf --verbose
# Process small page range first
pyforge convert problematic.pdf test.txt --pages "1-5"
Output Format¶
Text Structure¶
The extracted text maintains:
- Paragraph breaks: Preserved from original
- Line breaks: Maintained where appropriate
- Spacing: Basic spacing preserved
- Headers/Footers: Included in extraction
- Page breaks: Marked with page numbers (if
--metadata
used)
Example Output Structure¶
Page 1
======
ANNUAL REPORT 2023
Finance Department
Executive Summary
This report provides a comprehensive overview of our
financial performance for the fiscal year 2023...
Key Highlights:
• Revenue increased by 15%
• Profit margins improved
• Successful market expansion
Page 2
======
Financial Overview
The following table shows our quarterly performance:
Q1 $2.5M 15%
Q2 $2.8M 18%
Q3 $3.1M 20%
Q4 $3.4M 22%
...
Integration Examples¶
Bash Scripting¶
#!/bin/bash
# Process all PDFs in a directory
for pdf in *.pdf; do
echo "Processing $pdf..."
pyforge convert "$pdf" "${pdf%.pdf}.txt" --metadata
echo "✓ Completed $pdf"
done
Python Integration¶
import subprocess
import os
def extract_pdf_text(pdf_path, output_path=None, pages=None):
"""Extract text from PDF using PyForge CLI"""
cmd = ["pyforge", "convert", pdf_path]
if output_path:
cmd.append(output_path)
if pages:
cmd.extend(["--pages", pages])
cmd.append("--metadata")
result = subprocess.run(cmd, capture_output=True, text=True)
return result.returncode == 0
# Usage
success = extract_pdf_text("report.pdf", "extracted.txt", "1-10")
Next Steps¶
- Excel Converter - Learn about Excel to Parquet conversion
- CLI Reference - Complete command documentation
- Tutorials - Real-world PDF processing workflows
- Troubleshooting - Solve common PDF issues