CLI Command Reference¶
Complete reference for all PyForge CLI commands, options, and usage patterns.
Version 1.0.9 - Enhanced with Databricks Serverless support and Unity Catalog volume handling.
Main Commands¶
pyforge install
¶
Install prerequisites and sample datasets for PyForge CLI.
Available Components¶
Install curated test datasets for all supported formats.
Examples:
# Install all datasets to default location
pyforge install sample-datasets
# Install to custom directory
pyforge install sample-datasets ./test-data
# Install to Unity Catalog volume
pyforge install sample-datasets /Volumes/catalog/schema/volume/datasets
# Install specific formats only
pyforge install sample-datasets --formats pdf,excel,xml
# Install small datasets only
pyforge install sample-datasets --sizes small
# List available releases
pyforge install sample-datasets --list-releases
# Show installed datasets
pyforge install sample-datasets --list-installed
# Uninstall datasets
pyforge install sample-datasets --uninstall --force
Options:
Option | Type | Description |
---|---|---|
--version <version> |
string | Specific release version (e.g., v1.0.0) |
--formats <list> |
string | Comma-separated format list (pdf,excel,xml,access,dbf,mdf,csv) |
--sizes <list> |
string | Size categories (small,medium,large) |
--list-releases |
flag | List all available dataset releases |
--list-installed |
flag | Show currently installed datasets |
--force |
flag | Force overwrite existing datasets |
--uninstall |
flag | Remove installed datasets |
Install Docker Desktop and SQL Server Express for MDF file processing.
Examples:
pyforge convert
¶
Convert files between different formats with enhanced support for Databricks Serverless environments and Unity Catalog volumes.
Examples¶
# Basic conversion
pyforge convert document.pdf
# Using sample datasets
pyforge convert sample-datasets/pdf/small/NIST-CSWP-04162018.pdf
pyforge convert sample-datasets/excel/small/financial-sample.xlsx
# With custom output
pyforge convert document.pdf extracted_text.txt
# PDF with page range
pyforge convert report.pdf --pages "1-10"
# Excel with specific sheets
pyforge convert data.xlsx --sheets "Sheet1,Summary"
# XML with intelligent flattening
pyforge convert api_response.xml --flatten-strategy aggressive
# Database conversion
pyforge convert database.mdb output_directory/
# Database with specific backend
pyforge convert database.mdb --backend subprocess --verbose
# Unity Catalog volume files
pyforge convert /Volumes/catalog/schema/volume/data.mdb --volume-path
# CSV with auto-detection
pyforge convert data.csv --compression gzip
Options¶
Option | Type | Description | Applies To |
---|---|---|---|
--pages <range> |
string | Page range to convert (e.g., "1-10") | |
--metadata |
flag | Include file metadata in output | |
--sheets <names> |
string | Comma-separated sheet names | Excel |
--combine |
flag | Combine sheets into single output | Excel |
--separate |
flag | Keep sheets as separate files | Excel |
--interactive |
flag | Interactive sheet selection | Excel |
--compression <type> |
string | Compression type (gzip, snappy, lz4) | Parquet outputs |
--encoding <encoding> |
string | Character encoding (e.g., cp1252) | DBF |
--tables <names> |
string | Comma-separated table names | MDB/ACCDB |
--password <password> |
string | Database password | MDB/ACCDB |
--flatten-strategy <strategy> |
string | XML flattening: conservative, moderate, aggressive | XML |
--array-handling <mode> |
string | XML array handling: expand, concatenate, json_string | XML |
--namespace-handling <mode> |
string | XML namespace handling: preserve, strip, prefix | XML |
--preview-schema |
flag | Preview XML structure before conversion | XML |
--backend <backend> |
string | Backend to use: subprocess, shell (for MDB/ACCDB) | MDB/ACCDB |
--volume-path |
flag | Enable Unity Catalog volume path handling | All |
--force |
flag | Overwrite existing output files | All |
--verbose |
flag | Enable detailed output | All |
pyforge info
¶
Display detailed information about a file.
Examples¶
# Basic file information
pyforge info document.pdf
# Detailed information
pyforge info spreadsheet.xlsx --verbose
# JSON output format
pyforge info database.mdb --format json
Options¶
Option | Type | Description |
---|---|---|
--format <type> |
string | Output format: table, json, yaml |
--verbose |
flag | Show detailed information |
pyforge validate
¶
Validate if a file can be processed by PyForge CLI.
Examples¶
# Validate PDF file
pyforge validate document.pdf
# Validate with detailed output
pyforge validate spreadsheet.xlsx --verbose
# Batch validate files
for file in *.xlsx; do pyforge validate "$file"; done
Options¶
Option | Type | Description |
---|---|---|
--verbose |
flag | Show detailed validation information |
pyforge formats
¶
List all supported input and output formats.
Examples¶
# List all formats
pyforge formats
# Show format details
pyforge formats --verbose
# Filter by input format
pyforge formats --input pdf
Options¶
Option | Type | Description |
---|---|---|
--input <format> |
string | Filter by input format |
--output <format> |
string | Filter by output format |
--verbose |
flag | Show detailed format information |
Global Options¶
These options work with all commands:
Option | Description | Example |
---|---|---|
--help, -h |
Show help message | pyforge --help |
--version |
Show version information | pyforge --version |
--verbose, -v |
Enable verbose output | pyforge convert file.pdf --verbose |
Environment Variables¶
PyForge CLI recognizes these environment variables:
Variable | Description | Default |
---|---|---|
PYFORGE_OUTPUT_DIR |
Default output directory | Current directory |
PYFORGE_TEMP_DIR |
Temporary file directory | System temp |
PYFORGE_MAX_MEMORY |
Maximum memory usage (MB) | Auto-detect |
PYFORGE_COMPRESSION |
Default compression for Parquet | snappy |
IS_SERVERLESS |
Databricks Serverless environment flag | false |
SPARK_CONNECT_MODE_ENABLED |
Databricks Spark Connect mode | false |
DATABRICKS_RUNTIME_VERSION |
Databricks runtime version | none |
Exit Codes¶
Code | Meaning | Description |
---|---|---|
0 | Success | Operation completed successfully |
1 | General Error | Unknown or general error |
2 | File Not Found | Input file does not exist |
3 | Permission Error | Cannot read input or write output |
4 | Format Error | Unsupported or corrupted file format |
5 | Validation Error | File failed validation |
6 | Memory Error | Insufficient memory for operation |
Configuration File¶
PyForge CLI can use a configuration file for default settings:
Location: ~/.pyforge/config.yaml
# Default settings
defaults:
compression: gzip
output_dir: ~/conversions
verbose: false
# PDF-specific settings
pdf:
include_metadata: true
# Excel-specific settings
excel:
combine_sheets: false
compression: snappy
# Database settings
database:
encoding: utf-8
# XML-specific settings
xml:
flatten_strategy: conservative
array_handling: expand
namespace_handling: preserve
preview_schema: false
Advanced Usage Patterns¶
Databricks Serverless Usage¶
# Unity Catalog volume file conversion
pyforge convert /Volumes/catalog/schema/volume/data.mdb --volume-path --verbose
# Large database with subprocess backend
pyforge convert /Volumes/catalog/schema/volume/large.accdb \
--backend subprocess --volume-path --tables "customers,orders"
# Batch processing in Databricks notebooks
%sh
for file in /Volumes/catalog/schema/volume/*.mdb; do
pyforge convert "$file" --backend subprocess --volume-path --force
done
# Using dbutils for file operations
dbutils.fs.ls("/Volumes/catalog/schema/volume/")
pyforge convert /Volumes/catalog/schema/volume/data.mdb --volume-path
dbutils.fs.ls("/Volumes/catalog/schema/volume/output/")
# Install and use in Databricks Serverless
%pip install /Volumes/catalog/schema/pkgs/pyforge_cli-1.0.9-py3-none-any.whl \
--no-cache-dir --quiet --index-url https://pypi.org/simple/ --trusted-host pypi.org
dbutils.library.restartPython()
# Validate installation
pyforge --version
pyforge convert /Volumes/catalog/schema/volume/test.mdb --backend subprocess --verbose
Batch Processing¶
# Process all PDFs in directory
find . -name "*.pdf" -exec pyforge convert {} \;
# Convert with consistent naming
for file in *.xlsx; do
pyforge convert "$file" "${file%.xlsx}.parquet"
done
# Parallel processing
ls *.pdf | xargs -P 4 -I {} pyforge convert {}
# Batch convert XML files with consistent strategy
for file in *.xml; do
pyforge convert "$file" "${file%.xml}.parquet" --flatten-strategy moderate
done
# Process XML files with different strategies based on size
find . -name "*.xml" -size +10M -exec pyforge convert {} --flatten-strategy conservative \;
find . -name "*.xml" -size -10M -exec pyforge convert {} --flatten-strategy aggressive \;
Pipeline Integration¶
# Use in shell pipeline
pyforge info *.xlsx | grep "Sheets:" | wc -l
# With other tools
find /data -name "*.mdb" | while read file; do
pyforge convert "$file" && echo "Converted: $file"
done
# Databricks Unity Catalog volume processing
find /Volumes/catalog/schema/volume -name "*.mdb" | while read file; do
pyforge convert "$file" --volume-path --backend subprocess
done
# Databricks notebook integration
%sh
# Check available MDB files
dbutils.fs.ls("/Volumes/catalog/schema/volume/")
# Convert all MDB files with proper backend
for file in /Volumes/catalog/schema/volume/*.mdb; do
pyforge convert "$file" --backend subprocess --volume-path --verbose
done
# Verify outputs
dbutils.fs.ls("/Volumes/catalog/schema/volume/output/")
Error Handling¶
# Check exit code
if pyforge convert file.pdf; then
echo "Conversion successful"
else
echo "Conversion failed with code $?"
fi
# Conditional processing
pyforge validate file.xlsx && pyforge convert file.xlsx
Format-Specific Examples¶
PDF Processing¶
# Extract specific pages
pyforge convert manual.pdf chapter1.txt --pages "1-25"
# Include metadata and page markers
pyforge convert report.pdf --metadata --pages "1-10"
# Process multiple page ranges
pyforge convert book.pdf intro.txt --pages "1-5"
pyforge convert book.pdf content.txt --pages "6-200"
pyforge convert book.pdf appendix.txt --pages "201-"
Excel Processing¶
# Interactive sheet selection
pyforge convert workbook.xlsx --interactive
# Specific sheets with compression
pyforge convert data.xlsx --sheets "Data,Summary" --compression gzip
# Combine all sheets
pyforge convert financial.xlsx combined.parquet --combine
# Separate files for each sheet
pyforge convert report.xlsx --separate
Database Processing¶
# Convert with password
pyforge convert secure.mdb --password "secret123"
# Specific tables only
pyforge convert database.mdb --tables "customers,orders,products"
# Custom output directory
pyforge convert large.accdb /output/database/
# Databricks Serverless with subprocess backend
pyforge convert database.mdb --backend subprocess
# Unity Catalog volume path
pyforge convert /Volumes/catalog/schema/volume/data.mdb --volume-path
# Combined Databricks options
pyforge convert /Volumes/catalog/schema/volume/secure.mdb \
--backend subprocess --volume-path --password "secret123"
XML Processing¶
# Conservative flattening (default)
pyforge convert api_response.xml --flatten-strategy conservative
# Aggressive flattening for analytics
pyforge convert catalog.xml --flatten-strategy aggressive
# Handle arrays as concatenated strings
pyforge convert orders.xml --array-handling concatenate
# Strip namespaces for cleaner columns
pyforge convert soap_response.xml --namespace-handling strip
# Preview structure before conversion
pyforge convert complex.xml --preview-schema
# Convert compressed XML files
pyforge convert data.xml.gz --verbose
# Combined options for data analysis
pyforge convert api_data.xml analysis.parquet \
--flatten-strategy aggressive \
--array-handling expand \
--namespace-handling strip \
--compression gzip
DBF Processing¶
# With specific encoding
pyforge convert legacy.dbf --encoding cp1252
# Force processing corrupted files
pyforge convert damaged.dbf --force
# Verbose output for debugging
pyforge convert complex.dbf --verbose
Troubleshooting Commands¶
Debug Information¶
# System information
pyforge --version
python --version
pip show pyforge-cli
# File analysis
pyforge info problematic_file.pdf --verbose
pyforge validate problematic_file.pdf --verbose
# Test with minimal options
pyforge convert test_file.pdf --verbose
# Databricks environment debugging
echo "IS_SERVERLESS: $IS_SERVERLESS"
echo "SPARK_CONNECT_MODE_ENABLED: $SPARK_CONNECT_MODE_ENABLED"
echo "DATABRICKS_RUNTIME_VERSION: $DATABRICKS_RUNTIME_VERSION"
# Test MDB backend availability
pyforge convert test.mdb --backend subprocess --verbose
Common Issues¶
# Permission problems
sudo chown $USER output_directory/
chmod 755 output_directory/
# Memory issues
PYFORGE_MAX_MEMORY=1024 pyforge convert large_file.xlsx
# Encoding problems
pyforge convert file.dbf --encoding utf-8 --verbose
# Databricks volume path issues
pyforge convert /Volumes/catalog/schema/volume/file.mdb --volume-path --verbose
# MDB backend selection
pyforge convert database.mdb --backend subprocess --verbose
# Unity Catalog volume permissions
dbutils.fs.ls("/Volumes/catalog/schema/volume/")
Performance Monitoring¶
Timing Commands¶
# Time conversion
time pyforge convert large_file.xlsx
# Monitor memory usage
/usr/bin/time -v pyforge convert file.mdb
# Progress tracking
pyforge convert large_file.pdf --verbose
Optimization¶
# Use compression for large outputs
pyforge convert file.xlsx --compression gzip
# Process in chunks
pyforge convert large.pdf chunk1.txt --pages "1-100"
pyforge convert large.pdf chunk2.txt --pages "101-200"
# Parallel processing
ls *.dbf | xargs -P $(nproc) -I {} pyforge convert {}
Integration Examples¶
Makefile Integration¶
%.txt: %.pdf
pyforge convert $< $@
%.parquet: %.xlsx
pyforge convert $< $@ --combine
all-pdfs: $(patsubst %.pdf,%.txt,$(wildcard *.pdf))
Python Subprocess¶
import subprocess
import json
def convert_file(input_path, output_path=None, **options):
cmd = ["pyforge", "convert", input_path]
if output_path:
cmd.append(output_path)
for key, value in options.items():
cmd.append(f"--{key.replace('_', '-')}")
if value is not True:
cmd.append(str(value))
return subprocess.run(cmd, capture_output=True, text=True)
def get_file_info(file_path):
result = subprocess.run(
["pyforge", "info", file_path, "--format", "json"],
capture_output=True, text=True
)
return json.loads(result.stdout) if result.returncode == 0 else None
Installation Commands¶
pyforge install
¶
Install prerequisites for specific file format converters.
Available Tools¶
pyforge install mdf-tools
¶
Install Docker Desktop and SQL Server Express for MDF file processing.
Options:
- --password <password>
: Custom SQL Server password (default: PyForge@2024!)
- --port <port>
: Custom SQL Server port (default: 1433)
- --non-interactive
: Run in non-interactive mode for automation
Examples:
# Default installation
pyforge install mdf-tools
# Custom password and port
pyforge install mdf-tools --password "MySecure123!" --port 1433
# Non-interactive mode (for scripts)
pyforge install mdf-tools --non-interactive
MDF Tools Management¶
pyforge mdf-tools
¶
Manage SQL Server Express container for MDF file processing.
pyforge mdf-tools status
¶
Check Docker and SQL Server status.
Sample Output:
MDF Tools Status
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Component ┃ Status ┃ Details ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Docker Installed │ ✓ OK │ Docker command available │
│ Docker Running │ ✓ OK │ Docker daemon responsive │
│ SQL Container Exists │ ✓ OK │ Container created │
│ SQL Container Running │ ✓ OK │ Container active │
│ SQL Server Responding │ ✓ OK │ Database accessible │
│ Configuration File │ ✓ OK │ Settings saved │
└───────────────────────┴────────┴──────────────────────────┘
✅ All systems operational - ready for MDF processing!
pyforge mdf-tools start
¶
Start the SQL Server Express container.
pyforge mdf-tools stop
¶
Stop the SQL Server Express container.
pyforge mdf-tools restart
¶
Restart the SQL Server Express container.
pyforge mdf-tools logs
¶
View SQL Server container logs.
Options:
- --lines N
, -n N
: Number of log lines to show (default: 50)
Examples:
# Show last 50 lines (default)
pyforge mdf-tools logs
# Show last 100 lines
pyforge mdf-tools logs --lines 100
# Show last 10 lines
pyforge mdf-tools logs -n 10
pyforge mdf-tools config
¶
Display current MDF tools configuration.
pyforge mdf-tools test
¶
Test SQL Server connectivity and responsiveness.
Sample Output:
pyforge mdf-tools uninstall
¶
Remove SQL Server container and clean up all data.
Warning: This command permanently removes the SQL Server container, all data volumes, and configuration files.
MDF Tools Usage Examples¶
Complete MDF Processing Workflow¶
# Step 1: Install MDF processing tools (one-time setup)
pyforge install mdf-tools
# Step 2: Verify installation
pyforge mdf-tools status
# Step 3: Test connectivity
pyforge mdf-tools test
# Step 4: Convert MDF files (when converter is available)
# pyforge convert database.mdf --format parquet
# Container lifecycle management
pyforge mdf-tools start # Start SQL Server
pyforge mdf-tools stop # Stop SQL Server
pyforge mdf-tools restart # Restart SQL Server
pyforge mdf-tools logs # View logs
pyforge mdf-tools config # Show configuration
pyforge mdf-tools uninstall # Complete removal
Automation and Scripting¶
# Non-interactive installation for CI/CD
pyforge install mdf-tools --non-interactive
# Check if ready for processing
if pyforge mdf-tools status | grep -q "All systems operational"; then
echo "Ready for MDF processing"
else
echo "MDF tools not ready"
exit 1
fi
# Automated container management
pyforge mdf-tools start && \
pyforge mdf-tools test && \
echo "SQL Server is ready for MDF processing"
Databricks Serverless Specific Features¶
Backend Selection for MDB/ACCDB Files¶
PyForge CLI v1.0.9 supports multiple backends for MDB/ACCDB conversion:
# Default backend (auto-selected)
pyforge convert database.mdb
# Force subprocess backend (recommended for Databricks Serverless)
pyforge convert database.mdb --backend subprocess
# Force shell backend (for local environments)
pyforge convert database.mdb --backend shell
Backend Selection Logic:
- Serverless Environment: Automatically uses subprocess
backend
- Local Environment: Prefers shell
backend, falls back to subprocess
- Manual Override: Use --backend
flag to force specific backend
Unity Catalog Volume Path Handling¶
# Enable volume path handling for Unity Catalog
pyforge convert /Volumes/catalog/schema/volume/data.mdb --volume-path
# Without volume path flag (may fail on volumes)
pyforge convert /Volumes/catalog/schema/volume/data.mdb # Not recommended
# Volume path with other options
pyforge convert /Volumes/catalog/schema/volume/secure.mdb \
--volume-path --backend subprocess --password "secret" --verbose
Installation in Databricks Serverless¶
# Install PyForge CLI in Databricks notebook
%pip install /Volumes/catalog/schema/pkgs/pyforge_cli-1.0.9-py3-none-any.whl \
--no-cache-dir --quiet --index-url https://pypi.org/simple/ --trusted-host pypi.org
# Restart Python kernel
dbutils.library.restartPython()
# Verify installation
%sh pyforge --version
Environment Detection¶
PyForge CLI automatically detects Databricks Serverless environment using:
# Environment variables checked
IS_SERVERLESS=TRUE
SPARK_CONNECT_MODE_ENABLED=1
DATABRICKS_RUNTIME_VERSION=client.14.3.x-scala2.12
Troubleshooting Databricks Issues¶
# Check environment detection
echo "IS_SERVERLESS: $IS_SERVERLESS"
echo "SPARK_CONNECT_MODE_ENABLED: $SPARK_CONNECT_MODE_ENABLED"
# Test backend availability
pyforge convert test.mdb --backend subprocess --verbose
# Volume path testing
dbutils.fs.ls("/Volumes/catalog/schema/volume/")
pyforge convert /Volumes/catalog/schema/volume/test.mdb --volume-path --verbose
# Debug Java availability
java -version # Should work in Databricks runtime
See Also¶
- MDF Tools Installer - Complete MDF tools documentation
- Options Matrix - All options organized by converter
- Output Formats - Output format specifications
- Tutorials - Real-world usage examples
- Troubleshooting - Common issues and solutions
- Databricks Integration Guide - Comprehensive Databricks setup and usage