Databricks Serverless Conversion Guide¶

This comprehensive guide walks you through using PyForge CLI in Databricks Serverless environments for converting various file formats including databases (.mdb/.accdb), DBF files, Excel spreadsheets, CSV files, and XML documents.

Download Interactive Notebook

📓 Download the Jupyter Notebook - Ready-to-use notebook with all examples and code cells properly formatted for Databricks.

Prerequisites¶

Databricks Serverless workspace access
Unity Catalog volume with read/write permissions
Sample datasets or your own files to convert

Step 1: Install PyForge CLI¶

In your Databricks Serverless notebook, install PyForge CLI with the proper PyPI index URL:

# Cell 1: Install PyForge CLI
%pip install pyforge-cli==1.0.9 --no-cache-dir --quiet --index-url https://pypi.org/simple/ --trusted-host pypi.org

# Cell 2: Restart Python kernel
dbutils.library.restartPython()

# Cell 3: Verify installation
import subprocess
result = subprocess.run(['pyforge', '--version'], capture_output=True, text=True)
print(f"PyForge CLI Version: {result.stdout.strip()}")

Always Include PyPI Index URL

The --index-url https://pypi.org/simple/ --trusted-host pypi.org flags are required in Databricks Serverless environments for proper dependency resolution in corporate networks.

Step 2: Install Sample Datasets¶

PyForge CLI includes a curated collection of sample datasets for testing all supported formats:

# Cell 4: Install sample datasets using shell command
%%sh
pyforge install sample-datasets --output /dbfs/Volumes/catalog/schema/volume/sample_datasets --verbose

# Cell 5: Define volume path for Python usage
volume_path = "dbfs:/Volumes/catalog/schema/volume/sample_datasets"

Step 3: List Available Sample Datasets¶

After installation, explore the available sample datasets:

# Cell 5: List all sample datasets
import os

# List files in the sample datasets directory
files = dbutils.fs.ls(f"{volume_path}/")

# Organize by file type
file_types = {
    'mdb': [],
    'accdb': [],
    'dbf': [],
    'xlsx': [],
    'csv': [],
    'xml': [],
    'pdf': []
}

for file_info in files:
    file_name = file_info.name
    for ext in file_types.keys():
        if file_name.endswith(f'.{ext}'):
            file_types[ext].append(file_name)

# Display organized list
print("📁 Available Sample Datasets:\n")
for file_type, files in file_types.items():
    if files:
        print(f"**{file_type.upper()} Files:**")
        for file in files:
            print(f"  - {file}")
        print()

Step 4: Database File Conversion (.mdb/.accdb)¶

Important: MDB/ACCDB Files Require Subprocess

Due to Java SDK dependencies, MDB/ACCDB files must use subprocess commands instead of %sh magic commands. All other file formats (CSV, XML, Excel, DBF) can use %sh commands.

Convert MDB File (Subprocess Method - Correct)¶

# Cell 6: Convert MDB database file
import subprocess

# Input and output paths
mdb_file = f"{volume_path}/Northwind.mdb"
output_dir = f"{volume_path}/converted/northwind/"

# Convert using subprocess (CORRECT method for Databricks Serverless)
result = subprocess.run([
    'pyforge', 'convert',
    mdb_file,
    output_dir,
    '--compression', 'gzip',
    '--verbose'
], capture_output=True, text=True)

print("Conversion Output:")
print(result.stdout)

# List converted files
if result.returncode == 0:
    converted_files = dbutils.fs.ls(output_dir)
    print("\n✅ Converted Tables:")
    for file in converted_files:
        if file.name.endswith('.parquet'):
            print(f"  - {file.name}")
else:
    print(f"❌ Conversion failed: {result.stderr}")

Convert ACCDB File¶

# Cell 7: Convert ACCDB database file
accdb_file = f"{volume_path}/AdventureWorks.accdb"
output_dir = f"{volume_path}/converted/adventureworks/"

result = subprocess.run([
    'pyforge', 'convert',
    accdb_file,
    output_dir,
    '--tables', 'Customers,Orders,Products',  # Convert specific tables
    '--compression', 'snappy',
    '--verbose'
], capture_output=True, text=True)

print("Conversion Output:")
print(result.stdout)

Step 5: DBF File Conversion¶

# Cell 8: Convert DBF file using shell command
%%sh
echo "Converting DBF file..."
pyforge convert /dbfs/Volumes/catalog/schema/volume/sample_datasets/customer_data.dbf \
    /dbfs/Volumes/catalog/schema/volume/converted/customer_data.parquet \
    --encoding cp1252 \
    --verbose

# Cell 9: Read and display converted data
import pandas as pd
output_file = f"{volume_path}/converted/customer_data.parquet"
df = pd.read_parquet(output_file.replace('dbfs:/', '/dbfs/'))
print(f"📊 Sample Data ({len(df)} rows):")
display(df.head())

Step 6: Excel File Conversion¶

# Cell 10: Convert Excel file using shell command
%%sh
echo "Converting Excel file..."
pyforge convert /dbfs/Volumes/catalog/schema/volume/sample_datasets/financial_report.xlsx \
    /dbfs/Volumes/catalog/schema/volume/converted/financial/ \
    --combine \
    --compression gzip \
    --verbose

# Cell 11: Load and analyze converted data
output_dir = f"{volume_path}/converted/financial/"
parquet_files = dbutils.fs.ls(output_dir)
for file in parquet_files:
    if file.name.endswith('.parquet'):
        df = pd.read_parquet(file.path.replace('dbfs:/', '/dbfs/'))
        print(f"\n📊 {file.name}: {len(df)} rows x {len(df.columns)} columns")
        display(df.head(3))

Step 7: CSV File Conversion¶

# Cell 12: Convert CSV file using shell command
%%sh
echo "Converting CSV file..."
pyforge convert /dbfs/Volumes/catalog/schema/volume/sample_datasets/sales_data.csv \
    /dbfs/Volumes/catalog/schema/volume/converted/sales_data.parquet \
    --compression snappy \
    --verbose

# Cell 13: Verify conversion
output_file = f"{volume_path}/converted/sales_data.parquet"
df = pd.read_parquet(output_file.replace('dbfs:/', '/dbfs/'))
print(f"✅ Successfully converted {len(df)} rows")
print(f"Columns: {', '.join(df.columns)}")

Step 8: XML File Conversion¶

# Cell 14: Convert XML file using shell command
%%sh
echo "Converting XML file..."
pyforge convert /dbfs/Volumes/catalog/schema/volume/sample_datasets/books_catalog.xml \
    /dbfs/Volumes/catalog/schema/volume/converted/books_catalog.parquet \
    --flatten-strategy moderate \
    --array-handling expand \
    --compression gzip \
    --verbose

# Cell 15: Display structured data
output_file = f"{volume_path}/converted/books_catalog.parquet"
df = pd.read_parquet(output_file.replace('dbfs:/', '/dbfs/'))
print(f"📚 Books Catalog: {len(df)} records")
display(df.head())

Step 9: Batch Processing Multiple Files¶

# Cell 12: Batch convert multiple files
import subprocess
from concurrent.futures import ThreadPoolExecutor, as_completed

def convert_file(file_path, output_dir):
    """Convert a single file and return status"""
    file_name = os.path.basename(file_path)

    result = subprocess.run([
        'pyforge', 'convert',
        file_path,
        output_dir,
        '--compression', 'gzip',
        '--verbose'
    ], capture_output=True, text=True)

    return {
        'file': file_name,
        'success': result.returncode == 0,
        'output': result.stdout,
        'error': result.stderr
    }

# Files to convert
files_to_convert = [
    f"{volume_path}/Northwind.mdb",
    f"{volume_path}/financial_report.xlsx",
    f"{volume_path}/customer_data.dbf",
    f"{volume_path}/sales_data.csv",
    f"{volume_path}/books_catalog.xml"
]

output_base = f"{volume_path}/batch_converted/"

# Process files in parallel
results = []
with ThreadPoolExecutor(max_workers=3) as executor:
    futures = {executor.submit(convert_file, file, output_base): file 
               for file in files_to_convert}

    for future in as_completed(futures):
        result = future.result()
        results.append(result)

        # Print status
        status = "✅" if result['success'] else "❌"
        print(f"{status} {result['file']}")

print(f"\n📊 Conversion Summary:")
print(f"Total files: {len(results)}")
print(f"Successful: {sum(1 for r in results if r['success'])}")
print(f"Failed: {sum(1 for r in results if not r['success'])}")

Step 10: Working with Converted Data¶

# Cell 13: Load and analyze converted data
import pandas as pd
import pyspark.sql.functions as F

# Load converted Northwind customers as Pandas DataFrame
customers_pd = pd.read_parquet(
    f'/dbfs/{volume_path.replace("dbfs:/", "")}/converted/northwind/Customers.parquet'
)
print(f"Pandas DataFrame: {len(customers_pd)} customers")
display(customers_pd.head())

# Load as Spark DataFrame for large-scale processing
customers_spark = spark.read.parquet(
    f"{volume_path}/converted/northwind/Customers.parquet"
)

# Register as temporary view
customers_spark.createOrReplaceTempView("customers")

# Run SQL analytics
result = spark.sql("""
    SELECT 
        Country,
        COUNT(*) as customer_count
    FROM customers
    GROUP BY Country
    ORDER BY customer_count DESC
    LIMIT 10
""")

print("\n📊 Customers by Country:")
display(result)

Troubleshooting¶

Common Issues and Solutions¶

1. Installation Failures¶

# If installation fails, try clearing cache
%pip cache purge
%pip install pyforge-cli==1.0.9 --no-cache-dir --quiet --index-url https://pypi.org/simple/ --trusted-host pypi.org --upgrade
dbutils.library.restartPython()

2. Path Resolution Issues¶

# Always use dbfs:// prefix for Unity Catalog volumes
✅ correct_path = "dbfs:/Volumes/catalog/schema/volume/file.mdb"
❌ incorrect_path = "/Volumes/catalog/schema/volume/file.mdb"

3. Subprocess Command Not Found¶

# Verify PyForge is in PATH after installation
import subprocess
import sys

# Check Python executable location
print(f"Python: {sys.executable}")

# Find pyforge location
result = subprocess.run(['which', 'pyforge'], capture_output=True, text=True)
print(f"PyForge location: {result.stdout.strip()}")

4. Memory Issues with Large Files¶

# Process large files with specific tables or chunking
result = subprocess.run([
    'pyforge', 'convert',
    'large_database.mdb',
    '--tables', 'Table1,Table2',  # Convert specific tables only
    '--compression', 'gzip',       # Use compression
    '--verbose'
], capture_output=True, text=True)

Best Practices¶

Use %sh commands for CSV, XML, Excel, and DBF file conversions
Use subprocess only for MDB/ACCDB files due to Java SDK dependencies
Include PyPI index URL in all pip install commands
Use dbfs:// prefix for Unity Catalog volume paths in Python code
Use /dbfs/ prefix for shell commands
Restart Python kernel after installing PyForge CLI
Use compression for large files to save storage
Verify conversions by checking output and return codes

Performance Tips¶

Parallel Processing for Multiple Files¶

# Use ThreadPoolExecutor for concurrent conversions
from concurrent.futures import ThreadPoolExecutor

# Limit workers to avoid overwhelming the cluster
max_workers = min(3, len(files_to_convert))

Memory-Efficient Processing¶

# For very large databases, process one table at a time
tables = ['Customers', 'Orders', 'OrderDetails', 'Products']
for table in tables:
    result = subprocess.run([
        'pyforge', 'convert', 'large_db.mdb',
        '--tables', table,
        '--compression', 'gzip'
    ], capture_output=True, text=True)

Optimize Spark Reading¶

# Use partition discovery for better performance
df = spark.read.option("mergeSchema", "true").parquet(
    f"{volume_path}/converted/data/*.parquet"
)

Next Steps¶

Explore the CLI Reference for all available options
Learn about XML conversion strategies
Review database conversion details
Check troubleshooting guide for more solutions

Summary¶

This guide demonstrated how to:

✅ Install PyForge CLI in Databricks Serverless with proper PyPI configuration
✅ Install and explore sample datasets in Unity Catalog volumes
✅ Convert database files (.mdb/.accdb) using subprocess commands
✅ Process DBF, Excel, CSV, and XML files with various options
✅ Perform batch conversions with parallel processing
✅ Load and analyze converted Parquet files with Pandas and Spark
✅ Handle common issues and optimize performance

Remember: Use subprocess commands for MDB/ACCDB files only. All other formats (CSV, XML, Excel, DBF) can use standard %sh commands!