Format Converters¶
PyForge CLI supports conversion between multiple data formats. Each converter is optimized for its specific format with intelligent processing and error handling.
Available Converters¶
- 
PDF to Text
Extract text from PDF documents with page range support
 - 
Excel to Parquet
Convert Excel workbooks to high-performance Parquet format
 - 
Database Files
Convert Access (MDB/ACCDB) and SQL Server (MDF) databases to Parquet
 - 
DBF Files
Convert legacy DBF database files to Parquet
 - 
:material-code-xml: XML to Parquet
Convert XML files to Parquet with intelligent flattening
 - 
CSV to Parquet
Convert CSV/TSV files to Parquet with auto-detection
 - 
MDF Tools Installer
Setup Docker & SQL Server Express for MDF file processing
 
Format Compatibility Matrix¶
| Input Format | File Extensions | Output Format | Status | Platform Support | 
|---|---|---|---|---|
.pdf | 
Text (.txt) | 
✅ Stable | Windows, macOS, Linux | |
| Excel | .xlsx | 
Parquet (.parquet) | 
✅ Stable | Windows, macOS, Linux | 
| XML | .xml, .xml.gz, .xml.bz2 | 
Parquet (.parquet) | 
✅ Stable | Windows, macOS, Linux | 
| Access | .mdb, .accdb | 
Parquet (.parquet) | 
✅ Stable | Windows, macOS*, Linux* | 
| SQL Server | .mdf | 
Parquet (.parquet) | 
🚧 In Development | Windows, macOS, Linux | 
| DBF | .dbf | 
Parquet (.parquet) | 
✅ Stable | Windows, macOS, Linux | 
| CSV | .csv, .tsv, .txt | 
Parquet (.parquet) | 
✅ Stable | Windows, macOS, Linux | 
*Requires mdbtools installation **Requires MDF Tools (Docker + SQL Server Express)
Conversion Features¶
Universal Features¶
All converters support these common features:
- Progress Tracking: Real-time progress bars and status updates
 - Error Handling: Graceful error recovery and detailed error messages
 - Metadata Preservation: Maintain important file metadata where possible
 - Batch Processing: Convert multiple files with consistent options
 - Verbose Output: Detailed logging for troubleshooting
 - Force Overwrite: Option to overwrite existing output files
 
Format-Specific Features¶
Each converter includes specialized features:
PDF Converter¶
- Page range selection (
--pages "1-10") - Metadata extraction (
--metadata) - Text formatting preservation
 - Font and layout information
 
Excel Converter¶
- Multi-sheet processing
 - Sheet selection (
--sheets "Sheet1,Sheet2") - Column matching for combining sheets
 - Compression options (
--compression gzip) - Interactive mode for sheet selection
 
Database Converters¶
- Automatic table discovery
 - Cross-platform compatibility
 - Password-protected database support
 - Custom output directory structure
 - Table filtering options
 
DBF Converter¶
- Automatic encoding detection
 - Support for various DBF formats
 - Field type preservation
 - Corrupted file recovery
 
Quick Start Examples¶
Basic Conversions¶
# Convert PDF to text
pyforge convert document.pdf
# Convert Excel to Parquet
pyforge convert spreadsheet.xlsx
# Convert Access database
pyforge convert database.mdb
# Convert DBF file
pyforge convert legacy.dbf
# Convert XML with intelligent flattening
pyforge convert api_response.xml
# Convert CSV with auto-detection
pyforge convert data.csv
Advanced Options¶
# PDF with page range and metadata
pyforge convert report.pdf --pages "1-20" --metadata
# Excel with specific sheets and compression
pyforge convert data.xlsx --sheets "Data,Summary" --compression gzip
# Database with custom output
pyforge convert database.mdb output_directory/
# DBF with specific encoding
pyforge convert legacy.dbf --encoding cp1252
# XML with aggressive flattening and array expansion
pyforge convert catalog.xml --flatten-strategy aggressive --array-handling expand
# CSV with compression
pyforge convert large_data.csv --compression gzip
Performance Considerations¶
File Size Guidelines¶
| Format | Small | Medium | Large | Very Large | 
|---|---|---|---|---|
| < 10 MB | 10-100 MB | 100 MB - 1 GB | > 1 GB | |
| Excel | < 50 MB | 50-200 MB | 200 MB - 1 GB | > 1 GB | 
| XML | < 10 MB | 10-100 MB | 100 MB - 1 GB | > 1 GB | 
| Access | < 100 MB | 100 MB - 1 GB | 1-10 GB | > 10 GB | 
| DBF | < 50 MB | 50-500 MB | 500 MB - 2 GB | > 2 GB | 
| CSV | < 50 MB | 50-500 MB | 500 MB - 2 GB | > 2 GB | 
Optimization Tips¶
Memory Management
For large files, PyForge CLI automatically optimizes memory usage:
- Streaming processing for large datasets
 - Chunked reading to prevent memory overflow
 - Progress reporting for long-running operations
 
Performance
To maximize performance:
- Use SSD storage for input and output files
 - Ensure sufficient free disk space (2x file size recommended)
 - Close other applications when processing very large files
 - Consider using compression for output files
 
Error Handling¶
PyForge CLI provides comprehensive error handling:
Common Issues and Solutions¶
| Error Type | Description | Solution | 
|---|---|---|
| File Not Found | Input file doesn't exist | Check file path and permissions | 
| Permission Denied | Cannot write output file | Check directory permissions | 
| Corrupted File | Input file is damaged | Try with --force option or repair file | 
| Encoding Issues | Character encoding problems | Specify encoding with --encoding | 
| Memory Error | File too large for available memory | Close other applications or use streaming mode | 
Troubleshooting Commands¶
# Validate file before conversion
pyforge validate input_file.xlsx
# Get detailed file information
pyforge info input_file.pdf
# Run with verbose output
pyforge convert file.mdb --verbose
# Test with force option
pyforge convert file.dbf --force
Output Formats¶
Text Output (PDF Converter)¶
- Format: Plain text (.txt)
 - Encoding: UTF-8
 - Features: Preserves line breaks, basic formatting
 
Parquet Output (All Other Converters)¶
- Format: Apache Parquet (.parquet)
 - Compression: SNAPPY (default), GZIP, LZ4, ZSTD
 - Schema: Automatically inferred from source data
 - Features: Column-oriented, highly compressed, fast read/write
 
Next Steps¶
Choose a converter to learn more about:
- PDF to Text - Document processing and text extraction
 - Excel to Parquet - Spreadsheet data conversion
 - XML to Parquet - XML flattening and structure analysis
 - Database Files - Access database migration
 - DBF Files - Legacy database modernization
 - CSV to Parquet - Delimited file processing
 
Or explore other sections:
- CLI Reference - Complete command documentation
 - Tutorials - Real-world examples and workflows
 - API Documentation - Using PyForge as a Python library