Databricks Serverless Conversion Guide¶
This comprehensive guide walks you through using PyForge CLI in Databricks Serverless environments for converting various file formats including databases (.mdb/.accdb), DBF files, Excel spreadsheets, CSV files, and XML documents.
Download Interactive Notebook
📓 Download the Jupyter Notebook - Ready-to-use notebook with all examples and code cells properly formatted for Databricks.
Prerequisites¶
- Databricks Serverless workspace access
 - Unity Catalog volume with read/write permissions
 - Sample datasets or your own files to convert
 
Step 1: Install PyForge CLI¶
In your Databricks Serverless notebook, install PyForge CLI with the proper PyPI index URL:
# Cell 1: Install PyForge CLI
%pip install pyforge-cli==1.0.9 --no-cache-dir --quiet --index-url https://pypi.org/simple/ --trusted-host pypi.org
# Cell 2: Restart Python kernel
dbutils.library.restartPython()
# Cell 3: Verify installation
import subprocess
result = subprocess.run(['pyforge', '--version'], capture_output=True, text=True)
print(f"PyForge CLI Version: {result.stdout.strip()}")
Always Include PyPI Index URL
The --index-url https://pypi.org/simple/ --trusted-host pypi.org flags are required in Databricks Serverless environments for proper dependency resolution in corporate networks.
Step 2: Install Sample Datasets¶
PyForge CLI includes a curated collection of sample datasets for testing all supported formats:
# Cell 4: Install sample datasets using shell command
%%sh
pyforge install sample-datasets --output /dbfs/Volumes/catalog/schema/volume/sample_datasets --verbose
# Cell 5: Define volume path for Python usage
volume_path = "dbfs:/Volumes/catalog/schema/volume/sample_datasets"
Step 3: List Available Sample Datasets¶
After installation, explore the available sample datasets:
# Cell 5: List all sample datasets
import os
# List files in the sample datasets directory
files = dbutils.fs.ls(f"{volume_path}/")
# Organize by file type
file_types = {
    'mdb': [],
    'accdb': [],
    'dbf': [],
    'xlsx': [],
    'csv': [],
    'xml': [],
    'pdf': []
}
for file_info in files:
    file_name = file_info.name
    for ext in file_types.keys():
        if file_name.endswith(f'.{ext}'):
            file_types[ext].append(file_name)
# Display organized list
print("📁 Available Sample Datasets:\n")
for file_type, files in file_types.items():
    if files:
        print(f"**{file_type.upper()} Files:**")
        for file in files:
            print(f"  - {file}")
        print()
Step 4: Database File Conversion (.mdb/.accdb)¶
Important: MDB/ACCDB Files Require Subprocess
Due to Java SDK dependencies, MDB/ACCDB files must use subprocess commands instead of %sh magic commands. All other file formats (CSV, XML, Excel, DBF) can use %sh commands.
Convert MDB File (Subprocess Method - Correct)¶
# Cell 6: Convert MDB database file
import subprocess
# Input and output paths
mdb_file = f"{volume_path}/Northwind.mdb"
output_dir = f"{volume_path}/converted/northwind/"
# Convert using subprocess (CORRECT method for Databricks Serverless)
result = subprocess.run([
    'pyforge', 'convert',
    mdb_file,
    output_dir,
    '--compression', 'gzip',
    '--verbose'
], capture_output=True, text=True)
print("Conversion Output:")
print(result.stdout)
# List converted files
if result.returncode == 0:
    converted_files = dbutils.fs.ls(output_dir)
    print("\n✅ Converted Tables:")
    for file in converted_files:
        if file.name.endswith('.parquet'):
            print(f"  - {file.name}")
else:
    print(f"❌ Conversion failed: {result.stderr}")
Convert ACCDB File¶
# Cell 7: Convert ACCDB database file
accdb_file = f"{volume_path}/AdventureWorks.accdb"
output_dir = f"{volume_path}/converted/adventureworks/"
result = subprocess.run([
    'pyforge', 'convert',
    accdb_file,
    output_dir,
    '--tables', 'Customers,Orders,Products',  # Convert specific tables
    '--compression', 'snappy',
    '--verbose'
], capture_output=True, text=True)
print("Conversion Output:")
print(result.stdout)
Step 5: DBF File Conversion¶
# Cell 8: Convert DBF file using shell command
%%sh
echo "Converting DBF file..."
pyforge convert /dbfs/Volumes/catalog/schema/volume/sample_datasets/customer_data.dbf \
    /dbfs/Volumes/catalog/schema/volume/converted/customer_data.parquet \
    --encoding cp1252 \
    --verbose
# Cell 9: Read and display converted data
import pandas as pd
output_file = f"{volume_path}/converted/customer_data.parquet"
df = pd.read_parquet(output_file.replace('dbfs:/', '/dbfs/'))
print(f"📊 Sample Data ({len(df)} rows):")
display(df.head())
Step 6: Excel File Conversion¶
# Cell 10: Convert Excel file using shell command
%%sh
echo "Converting Excel file..."
pyforge convert /dbfs/Volumes/catalog/schema/volume/sample_datasets/financial_report.xlsx \
    /dbfs/Volumes/catalog/schema/volume/converted/financial/ \
    --combine \
    --compression gzip \
    --verbose
# Cell 11: Load and analyze converted data
output_dir = f"{volume_path}/converted/financial/"
parquet_files = dbutils.fs.ls(output_dir)
for file in parquet_files:
    if file.name.endswith('.parquet'):
        df = pd.read_parquet(file.path.replace('dbfs:/', '/dbfs/'))
        print(f"\n📊 {file.name}: {len(df)} rows x {len(df.columns)} columns")
        display(df.head(3))
Step 7: CSV File Conversion¶
# Cell 12: Convert CSV file using shell command
%%sh
echo "Converting CSV file..."
pyforge convert /dbfs/Volumes/catalog/schema/volume/sample_datasets/sales_data.csv \
    /dbfs/Volumes/catalog/schema/volume/converted/sales_data.parquet \
    --compression snappy \
    --verbose
# Cell 13: Verify conversion
output_file = f"{volume_path}/converted/sales_data.parquet"
df = pd.read_parquet(output_file.replace('dbfs:/', '/dbfs/'))
print(f"✅ Successfully converted {len(df)} rows")
print(f"Columns: {', '.join(df.columns)}")
Step 8: XML File Conversion¶
# Cell 14: Convert XML file using shell command
%%sh
echo "Converting XML file..."
pyforge convert /dbfs/Volumes/catalog/schema/volume/sample_datasets/books_catalog.xml \
    /dbfs/Volumes/catalog/schema/volume/converted/books_catalog.parquet \
    --flatten-strategy moderate \
    --array-handling expand \
    --compression gzip \
    --verbose
# Cell 15: Display structured data
output_file = f"{volume_path}/converted/books_catalog.parquet"
df = pd.read_parquet(output_file.replace('dbfs:/', '/dbfs/'))
print(f"📚 Books Catalog: {len(df)} records")
display(df.head())
Step 9: Batch Processing Multiple Files¶
# Cell 12: Batch convert multiple files
import subprocess
from concurrent.futures import ThreadPoolExecutor, as_completed
def convert_file(file_path, output_dir):
    """Convert a single file and return status"""
    file_name = os.path.basename(file_path)
    result = subprocess.run([
        'pyforge', 'convert',
        file_path,
        output_dir,
        '--compression', 'gzip',
        '--verbose'
    ], capture_output=True, text=True)
    return {
        'file': file_name,
        'success': result.returncode == 0,
        'output': result.stdout,
        'error': result.stderr
    }
# Files to convert
files_to_convert = [
    f"{volume_path}/Northwind.mdb",
    f"{volume_path}/financial_report.xlsx",
    f"{volume_path}/customer_data.dbf",
    f"{volume_path}/sales_data.csv",
    f"{volume_path}/books_catalog.xml"
]
output_base = f"{volume_path}/batch_converted/"
# Process files in parallel
results = []
with ThreadPoolExecutor(max_workers=3) as executor:
    futures = {executor.submit(convert_file, file, output_base): file 
               for file in files_to_convert}
    for future in as_completed(futures):
        result = future.result()
        results.append(result)
        # Print status
        status = "✅" if result['success'] else "❌"
        print(f"{status} {result['file']}")
print(f"\n📊 Conversion Summary:")
print(f"Total files: {len(results)}")
print(f"Successful: {sum(1 for r in results if r['success'])}")
print(f"Failed: {sum(1 for r in results if not r['success'])}")
Step 10: Working with Converted Data¶
# Cell 13: Load and analyze converted data
import pandas as pd
import pyspark.sql.functions as F
# Load converted Northwind customers as Pandas DataFrame
customers_pd = pd.read_parquet(
    f'/dbfs/{volume_path.replace("dbfs:/", "")}/converted/northwind/Customers.parquet'
)
print(f"Pandas DataFrame: {len(customers_pd)} customers")
display(customers_pd.head())
# Load as Spark DataFrame for large-scale processing
customers_spark = spark.read.parquet(
    f"{volume_path}/converted/northwind/Customers.parquet"
)
# Register as temporary view
customers_spark.createOrReplaceTempView("customers")
# Run SQL analytics
result = spark.sql("""
    SELECT 
        Country,
        COUNT(*) as customer_count
    FROM customers
    GROUP BY Country
    ORDER BY customer_count DESC
    LIMIT 10
""")
print("\n📊 Customers by Country:")
display(result)
Troubleshooting¶
Common Issues and Solutions¶
1. Installation Failures¶
# If installation fails, try clearing cache
%pip cache purge
%pip install pyforge-cli==1.0.9 --no-cache-dir --quiet --index-url https://pypi.org/simple/ --trusted-host pypi.org --upgrade
dbutils.library.restartPython()
2. Path Resolution Issues¶
# Always use dbfs:// prefix for Unity Catalog volumes
✅ correct_path = "dbfs:/Volumes/catalog/schema/volume/file.mdb"
❌ incorrect_path = "/Volumes/catalog/schema/volume/file.mdb"
3. Subprocess Command Not Found¶
# Verify PyForge is in PATH after installation
import subprocess
import sys
# Check Python executable location
print(f"Python: {sys.executable}")
# Find pyforge location
result = subprocess.run(['which', 'pyforge'], capture_output=True, text=True)
print(f"PyForge location: {result.stdout.strip()}")
4. Memory Issues with Large Files¶
# Process large files with specific tables or chunking
result = subprocess.run([
    'pyforge', 'convert',
    'large_database.mdb',
    '--tables', 'Table1,Table2',  # Convert specific tables only
    '--compression', 'gzip',       # Use compression
    '--verbose'
], capture_output=True, text=True)
Best Practices¶
- Use %sh commands for CSV, XML, Excel, and DBF file conversions
 - Use subprocess only for MDB/ACCDB files due to Java SDK dependencies
 - Include PyPI index URL in all pip install commands
 - Use dbfs:// prefix for Unity Catalog volume paths in Python code
 - Use /dbfs/ prefix for shell commands
 - Restart Python kernel after installing PyForge CLI
 - Use compression for large files to save storage
 - Verify conversions by checking output and return codes
 
Performance Tips¶
Parallel Processing for Multiple Files¶
# Use ThreadPoolExecutor for concurrent conversions
from concurrent.futures import ThreadPoolExecutor
# Limit workers to avoid overwhelming the cluster
max_workers = min(3, len(files_to_convert))
Memory-Efficient Processing¶
# For very large databases, process one table at a time
tables = ['Customers', 'Orders', 'OrderDetails', 'Products']
for table in tables:
    result = subprocess.run([
        'pyforge', 'convert', 'large_db.mdb',
        '--tables', table,
        '--compression', 'gzip'
    ], capture_output=True, text=True)
Optimize Spark Reading¶
# Use partition discovery for better performance
df = spark.read.option("mergeSchema", "true").parquet(
    f"{volume_path}/converted/data/*.parquet"
)
Next Steps¶
- Explore the CLI Reference for all available options
 - Learn about XML conversion strategies
 - Review database conversion details
 - Check troubleshooting guide for more solutions
 
Summary¶
This guide demonstrated how to:
✅ Install PyForge CLI in Databricks Serverless with proper PyPI configuration
✅ Install and explore sample datasets in Unity Catalog volumes
✅ Convert database files (.mdb/.accdb) using subprocess commands
✅ Process DBF, Excel, CSV, and XML files with various options
✅ Perform batch conversions with parallel processing
✅ Load and analyze converted Parquet files with Pandas and Spark
✅ Handle common issues and optimize performance  
Remember: Use subprocess commands for MDB/ACCDB files only. All other formats (CSV, XML, Excel, DBF) can use standard %sh commands!