Skip to content

PyForge CLI

PyForge CLI
A powerful command-line tool for data format conversion and synthetic data generation

What is PyForge CLI?

PyForge CLI is a modern, fast, and intuitive command-line tool designed for data practitioners who need to convert between various data formats. Whether you're working with legacy databases, processing documents, or preparing data for analysis, PyForge CLI provides the tools you need with a beautiful terminal interface.

Quick Start

Get up and running in under 2 minutes:

pip install pyforge-cli
pipx install pyforge-cli
uv add pyforge-cli

Your First Conversion

# Install sample datasets for testing
pyforge install sample-datasets

# Convert a PDF to text
pyforge convert document.pdf

# Convert Excel to Parquet
pyforge convert spreadsheet.xlsx

# Convert Access database
pyforge convert database.mdb

# Convert XML with intelligent flattening
pyforge convert api_response.xml

# Get help
pyforge --help

Supported Formats

Input Format Output Format Status Description
PDF (.pdf) Text (.txt) ✅ Available Extract text with metadata and page ranges
Excel (.xlsx) Parquet (.parquet) ✅ Available Multi-sheet support with intelligent merging
XML (.xml, .xml.gz, .xml.bz2) Parquet (.parquet) ✅ Available Intelligent flattening with configurable strategies
Access (.mdb/.accdb) Parquet (.parquet) ✅ Available Cross-platform database conversion
DBF (.dbf) Parquet (.parquet) ✅ Available Legacy database with encoding detection
CSV (.csv) Parquet (.parquet) ✅ Available Auto-detection of delimiters and encoding

Key Features

🚀 Fast & Efficient

Built with performance in mind, PyForge CLI handles large files efficiently with progress tracking and memory optimization.

🎨 Beautiful Interface

Rich terminal output with progress bars, colored text, and structured tables make the CLI a pleasure to use.

🔧 Intelligent Processing

  • Automatic encoding detection for legacy files
  • Smart table discovery and column matching
  • Metadata preservation across conversions

🔌 Extensible Architecture

Plugin-based system allows for easy addition of new format converters and custom processing logic.

📊 Data Practitioner Focused

Designed specifically for data engineers, scientists, and analysts with real-world use cases in mind.

Document Processing

Convert legal documents, reports, and contracts from PDF to searchable text for analysis.

pyforge convert contract.pdf --pages "1-10" --metadata

Legacy Database Migration

Modernize old Access and DBF databases by converting to Parquet format for cloud analytics.

pyforge convert legacy_system.mdb
pyforge convert customer_data.dbf --encoding cp1252

Excel Data Processing

Convert complex Excel workbooks to Parquet for efficient data processing and analysis.

pyforge convert financial_report.xlsx --combine --compression gzip

XML API Data Processing

Convert XML API responses and configuration files to Parquet for data analysis.

pyforge convert api_response.xml --flatten-strategy aggressive --array-handling expand
pyforge convert config.xml --namespace-handling strip

Getting Started

Choose your path based on your experience level:

Community & Support

What's New

Version 1.0.9 (Latest) 🚀

  • 🎯 Databricks Serverless Support: Complete file conversion support in Databricks Serverless environments
  • Subprocess Backend: Specialized backend for MDB/ACCDB files due to Java SDK dependencies
  • Unity Catalog Integration: Native support for /Volumes/ paths with automatic file handling
  • Shell Command Support: CSV, XML, Excel, and DBF files can use standard %sh magic commands
  • Enhanced Error Handling: Proper exception chaining for better debugging
  • Smart Backend Selection: Automatic detection and selection based on environment
  • Performance Improvements: Memory-efficient processing and optimized conversions
  • Interactive Notebook: Ready-to-use Jupyter notebook for Databricks Serverless

Version 1.0.8

  • Complete Testing Infrastructure Overhaul: Fixed 13 major issues across infrastructure
  • Sample Datasets Installation: Fixed with intelligent fallback versioning system
  • Missing Dependencies: Added PyMuPDF, chardet, requests to resolve import errors
  • Convert Command Fix: Resolved TypeError in ConverterRegistry API
  • Comprehensive Testing Framework: Created systematic testing with 402 lines of test code
  • Notebook Organization: Restructured with proper unit/integration/functional hierarchy

Version 0.5.0

  • 🎉 Sample Datasets Collection: 23 curated test datasets across all supported formats
  • Automated Installation: pyforge install sample-datasets command with GitHub Releases integration
  • Format Filtering: Install specific formats with --formats pdf,excel,xml
  • Size Categories: Small (<100MB), Medium (100MB-1GB), Large (>1GB) datasets
  • Progress Tracking: Rich terminal UI with download progress and checksums
  • Dataset Management: List releases, show installed datasets, and uninstall options

Version 0.4.0

  • 🚀 MDF Tools Installer: Complete SQL Server infrastructure for MDF file processing
  • Docker Integration: Automated Docker Desktop and SQL Server Express installation
  • Container Management: Full lifecycle commands for SQL Server container control
  • Cross-Platform Support: Windows, macOS, and Linux compatibility

Version 0.3.0

  • XML to Parquet Converter: Complete implementation with intelligent flattening
  • Automatic Structure Detection: Analyzes XML hierarchy and array patterns
  • Flexible Flattening Strategies: Conservative, moderate, and aggressive options
  • Advanced Array Handling: Expand, concatenate, or JSON string modes
  • Namespace Support: Configurable namespace processing
  • Schema Preview: Optional structure preview before conversion
  • Comprehensive Documentation: User guide and quick reference
  • Compressed XML Support: Handles .xml.gz and .xml.bz2 files

Version 0.2.5

  • ✅ Fixed package build configuration and PyPI publication metadata
  • ✅ Resolved InvalidDistribution errors for wheel packaging
  • ✅ Updated hatchling build configuration for src layout

Version 0.2.4

  • ✅ Fixed GitHub Actions deprecation warnings and workflow failures
  • ✅ Updated pypa/gh-action-pypi-publish to latest version
  • ✅ Removed redundant sigstore signing steps

Version 0.2.3

  • 🎉 Major Feature: CSV to Parquet conversion with auto-detection
  • ✅ Intelligent delimiter detection (comma, semicolon, tab, pipe)
  • ✅ Smart encoding detection (UTF-8, Latin-1, Windows-1252, UTF-16)
  • ✅ Header detection with fallback to generic column names
  • ✅ String-based conversion consistent with Phase 1 architecture

Version 0.2.2

  • ✅ Enhanced GitHub workflow templates for structured development
  • ✅ Updated README documentation with CSV support
  • ✅ Comprehensive testing and documentation for CSV converter

Version 0.2.1

  • ✅ Fixed GitHub Actions workflow for automated PyPI publishing
  • ✅ Updated CI/CD pipeline to use API token authentication

Version 0.2.0

  • ✅ Excel to Parquet conversion with multi-sheet support
  • ✅ MDB/ACCDB to Parquet conversion with cross-platform support
  • ✅ DBF to Parquet conversion with encoding detection
  • ✅ Interactive mode for Excel sheet selection
  • ✅ Progress tracking with rich terminal UI

View Complete Changelog


Ready to transform your data workflows?
Get Started Now View on GitHub