Skip to content

Format Converters

PyForge CLI

PyForge CLI supports conversion between multiple data formats. Each converter is optimized for its specific format with intelligent processing and error handling.

Available Converters

  • PDF to Text


    Extract text from PDF documents with page range support

    Learn More

  • Excel to Parquet


    Convert Excel workbooks to high-performance Parquet format

    Learn More

  • Database Files


    Convert Access (MDB/ACCDB) and SQL Server (MDF) databases to Parquet

    Learn More

  • DBF Files


    Convert legacy DBF database files to Parquet

    Learn More

  • :material-code-xml: XML to Parquet


    Convert XML files to Parquet with intelligent flattening

    Learn More

  • CSV to Parquet


    Convert CSV/TSV files to Parquet with auto-detection

    Learn More

  • MDF Tools Installer


    Setup Docker & SQL Server Express for MDF file processing

    Learn More

Format Compatibility Matrix

Input Format File Extensions Output Format Status Platform Support
PDF .pdf Text (.txt) ✅ Stable Windows, macOS, Linux
Excel .xlsx Parquet (.parquet) ✅ Stable Windows, macOS, Linux
XML .xml, .xml.gz, .xml.bz2 Parquet (.parquet) ✅ Stable Windows, macOS, Linux
Access .mdb, .accdb Parquet (.parquet) ✅ Stable Windows, macOS*, Linux*
SQL Server .mdf Parquet (.parquet) 🚧 In Development Windows, macOS, Linux
DBF .dbf Parquet (.parquet) ✅ Stable Windows, macOS, Linux
CSV .csv, .tsv, .txt Parquet (.parquet) ✅ Stable Windows, macOS, Linux

*Requires mdbtools installation **Requires MDF Tools (Docker + SQL Server Express)

Conversion Features

Universal Features

All converters support these common features:

  • Progress Tracking: Real-time progress bars and status updates
  • Error Handling: Graceful error recovery and detailed error messages
  • Metadata Preservation: Maintain important file metadata where possible
  • Batch Processing: Convert multiple files with consistent options
  • Verbose Output: Detailed logging for troubleshooting
  • Force Overwrite: Option to overwrite existing output files

Format-Specific Features

Each converter includes specialized features:

PDF Converter

  • Page range selection (--pages "1-10")
  • Metadata extraction (--metadata)
  • Text formatting preservation
  • Font and layout information

Excel Converter

  • Multi-sheet processing
  • Sheet selection (--sheets "Sheet1,Sheet2")
  • Column matching for combining sheets
  • Compression options (--compression gzip)
  • Interactive mode for sheet selection

Database Converters

  • Automatic table discovery
  • Cross-platform compatibility
  • Password-protected database support
  • Custom output directory structure
  • Table filtering options

DBF Converter

  • Automatic encoding detection
  • Support for various DBF formats
  • Field type preservation
  • Corrupted file recovery

Quick Start Examples

Basic Conversions

# Convert PDF to text
pyforge convert document.pdf

# Convert Excel to Parquet
pyforge convert spreadsheet.xlsx

# Convert Access database
pyforge convert database.mdb

# Convert DBF file
pyforge convert legacy.dbf

# Convert XML with intelligent flattening
pyforge convert api_response.xml

# Convert CSV with auto-detection
pyforge convert data.csv

Advanced Options

# PDF with page range and metadata
pyforge convert report.pdf --pages "1-20" --metadata

# Excel with specific sheets and compression
pyforge convert data.xlsx --sheets "Data,Summary" --compression gzip

# Database with custom output
pyforge convert database.mdb output_directory/

# DBF with specific encoding
pyforge convert legacy.dbf --encoding cp1252

# XML with aggressive flattening and array expansion
pyforge convert catalog.xml --flatten-strategy aggressive --array-handling expand

# CSV with compression
pyforge convert large_data.csv --compression gzip

Performance Considerations

File Size Guidelines

Format Small Medium Large Very Large
PDF < 10 MB 10-100 MB 100 MB - 1 GB > 1 GB
Excel < 50 MB 50-200 MB 200 MB - 1 GB > 1 GB
XML < 10 MB 10-100 MB 100 MB - 1 GB > 1 GB
Access < 100 MB 100 MB - 1 GB 1-10 GB > 10 GB
DBF < 50 MB 50-500 MB 500 MB - 2 GB > 2 GB
CSV < 50 MB 50-500 MB 500 MB - 2 GB > 2 GB

Optimization Tips

Memory Management

For large files, PyForge CLI automatically optimizes memory usage:

  • Streaming processing for large datasets
  • Chunked reading to prevent memory overflow
  • Progress reporting for long-running operations

Performance

To maximize performance:

  • Use SSD storage for input and output files
  • Ensure sufficient free disk space (2x file size recommended)
  • Close other applications when processing very large files
  • Consider using compression for output files

Error Handling

PyForge CLI provides comprehensive error handling:

Common Issues and Solutions

Error Type Description Solution
File Not Found Input file doesn't exist Check file path and permissions
Permission Denied Cannot write output file Check directory permissions
Corrupted File Input file is damaged Try with --force option or repair file
Encoding Issues Character encoding problems Specify encoding with --encoding
Memory Error File too large for available memory Close other applications or use streaming mode

Troubleshooting Commands

# Validate file before conversion
pyforge validate input_file.xlsx

# Get detailed file information
pyforge info input_file.pdf

# Run with verbose output
pyforge convert file.mdb --verbose

# Test with force option
pyforge convert file.dbf --force

Output Formats

Text Output (PDF Converter)

  • Format: Plain text (.txt)
  • Encoding: UTF-8
  • Features: Preserves line breaks, basic formatting

Parquet Output (All Other Converters)

  • Format: Apache Parquet (.parquet)
  • Compression: SNAPPY (default), GZIP, LZ4, ZSTD
  • Schema: Automatically inferred from source data
  • Features: Column-oriented, highly compressed, fast read/write

Next Steps

Choose a converter to learn more about:

Or explore other sections: