Format Converters¶

PyForge CLI supports conversion between multiple data formats. Each converter is optimized for its specific format with intelligent processing and error handling.

Available Converters¶

PDF to Text

Extract text from PDF documents with page range support

Learn More
Excel to Parquet

Convert Excel workbooks to high-performance Parquet format

Learn More
Database Files

Convert Access (MDB/ACCDB) and SQL Server (MDF) databases to Parquet

Learn More
DBF Files

Convert legacy DBF database files to Parquet

Learn More
:material-code-xml: XML to Parquet

Convert XML files to Parquet with intelligent flattening

Learn More
CSV to Parquet

Convert CSV/TSV files to Parquet with auto-detection

Learn More
MDF Tools Installer

Setup Docker & SQL Server Express for MDF file processing

Learn More

Format Compatibility Matrix¶

Input Format	File Extensions	Output Format	Status	Platform Support
PDF	`.pdf`	Text (`.txt`)	✅ Stable	Windows, macOS, Linux
Excel	`.xlsx`	Parquet (`.parquet`)	✅ Stable	Windows, macOS, Linux
XML	`.xml`, `.xml.gz`, `.xml.bz2`	Parquet (`.parquet`)	✅ Stable	Windows, macOS, Linux
Access	`.mdb`, `.accdb`	Parquet (`.parquet`)	✅ Stable	Windows, macOS, Linux
SQL Server	`.mdf`	Parquet (`.parquet`)	🚧 In Development	Windows, macOS, Linux
DBF	`.dbf`	Parquet (`.parquet`)	✅ Stable	Windows, macOS, Linux
CSV	`.csv`, `.tsv`, `.txt`	Parquet (`.parquet`)	✅ Stable	Windows, macOS, Linux

*Requires mdbtools installation **Requires MDF Tools (Docker + SQL Server Express)

Conversion Features¶

Universal Features¶

All converters support these common features:

Progress Tracking: Real-time progress bars and status updates
Error Handling: Graceful error recovery and detailed error messages
Metadata Preservation: Maintain important file metadata where possible
Batch Processing: Convert multiple files with consistent options
Verbose Output: Detailed logging for troubleshooting
Force Overwrite: Option to overwrite existing output files

Format-Specific Features¶

Each converter includes specialized features:

PDF Converter¶

Page range selection (--pages "1-10")
Metadata extraction (--metadata)
Text formatting preservation
Font and layout information

Excel Converter¶

Multi-sheet processing
Sheet selection (--sheets "Sheet1,Sheet2")
Column matching for combining sheets
Compression options (--compression gzip)
Interactive mode for sheet selection

Database Converters¶

Automatic table discovery
Cross-platform compatibility
Password-protected database support
Custom output directory structure
Table filtering options

DBF Converter¶

Automatic encoding detection
Support for various DBF formats
Field type preservation
Corrupted file recovery

Quick Start Examples¶

Basic Conversions¶

# Convert PDF to text
pyforge convert document.pdf

# Convert Excel to Parquet
pyforge convert spreadsheet.xlsx

# Convert Access database
pyforge convert database.mdb

# Convert DBF file
pyforge convert legacy.dbf

# Convert XML with intelligent flattening
pyforge convert api_response.xml

# Convert CSV with auto-detection
pyforge convert data.csv

Advanced Options¶

# PDF with page range and metadata
pyforge convert report.pdf --pages "1-20" --metadata

# Excel with specific sheets and compression
pyforge convert data.xlsx --sheets "Data,Summary" --compression gzip

# Database with custom output
pyforge convert database.mdb output_directory/

# DBF with specific encoding
pyforge convert legacy.dbf --encoding cp1252

# XML with aggressive flattening and array expansion
pyforge convert catalog.xml --flatten-strategy aggressive --array-handling expand

# CSV with compression
pyforge convert large_data.csv --compression gzip

Performance Considerations¶

File Size Guidelines¶

Format	Small	Medium	Large	Very Large
PDF	< 10 MB	10-100 MB	100 MB - 1 GB	> 1 GB
Excel	< 50 MB	50-200 MB	200 MB - 1 GB	> 1 GB
XML	< 10 MB	10-100 MB	100 MB - 1 GB	> 1 GB
Access	< 100 MB	100 MB - 1 GB	1-10 GB	> 10 GB
DBF	< 50 MB	50-500 MB	500 MB - 2 GB	> 2 GB
CSV	< 50 MB	50-500 MB	500 MB - 2 GB	> 2 GB

Optimization Tips¶

Memory Management

For large files, PyForge CLI automatically optimizes memory usage:

Streaming processing for large datasets
Chunked reading to prevent memory overflow
Progress reporting for long-running operations

Performance

To maximize performance:

Use SSD storage for input and output files
Ensure sufficient free disk space (2x file size recommended)
Close other applications when processing very large files
Consider using compression for output files

Error Handling¶

PyForge CLI provides comprehensive error handling:

Common Issues and Solutions¶

Error Type	Description	Solution
File Not Found	Input file doesn't exist	Check file path and permissions
Permission Denied	Cannot write output file	Check directory permissions
Corrupted File	Input file is damaged	Try with `--force` option or repair file
Encoding Issues	Character encoding problems	Specify encoding with `--encoding`
Memory Error	File too large for available memory	Close other applications or use streaming mode

Troubleshooting Commands¶

# Validate file before conversion
pyforge validate input_file.xlsx

# Get detailed file information
pyforge info input_file.pdf

# Run with verbose output
pyforge convert file.mdb --verbose

# Test with force option
pyforge convert file.dbf --force

Output Formats¶

Text Output (PDF Converter)¶

Format: Plain text (.txt)
Encoding: UTF-8
Features: Preserves line breaks, basic formatting

Parquet Output (All Other Converters)¶

Format: Apache Parquet (.parquet)
Compression: SNAPPY (default), GZIP, LZ4, ZSTD
Schema: Automatically inferred from source data
Features: Column-oriented, highly compressed, fast read/write

Next Steps¶

Choose a converter to learn more about:

PDF to Text - Document processing and text extraction
Excel to Parquet - Spreadsheet data conversion
XML to Parquet - XML flattening and structure analysis
Database Files - Access database migration
DBF Files - Legacy database modernization
CSV to Parquet - Delimited file processing

Or explore other sections:

CLI Reference - Complete command documentation
Tutorials - Real-world examples and workflows
API Documentation - Using PyForge as a Python library