Excel to Parquet Conversion¶

Convert Excel spreadsheets (.xlsx files) to efficient Parquet format with multi-sheet support and intelligent data processing.

Overview¶

PyForge CLI provides powerful Excel to Parquet conversion capabilities with:

Multi-sheet processing with automatic detection
Interactive sheet selection for complex workbooks
Column matching for combining similar sheets
Data type preservation and optimization
Compression options for space efficiency
Progress tracking with detailed reports

Basic Usage¶

Convert Single Excel File¶

# Convert entire Excel workbook
pyforge convert data.xlsx

# Output: data.parquet (if single sheet) or data/ directory (if multiple sheets)

Convert with Custom Output¶

# Specify output file/directory
pyforge convert sales_data.xlsx reports/sales.parquet

# Convert to specific directory
pyforge convert monthly_data.xlsx output_folder/

Advanced Options¶

Multi-Sheet Handling¶

Default BehaviorForce CombinationKeep Separate

# Automatic analysis and conversion of all sheets
pyforge convert workbook.xlsx

# Force combination of matching sheets into single file
pyforge convert workbook.xlsx --combine

# Keep all sheets as separate parquet files
pyforge convert workbook.xlsx --separate

Compression Options¶

# Use GZIP compression (recommended)
pyforge convert data.xlsx --compression gzip

# Use Snappy compression (faster)
pyforge convert data.xlsx --compression snappy

# No compression
pyforge convert data.xlsx --compression none

Advanced Processing¶

# Force overwrite existing files
pyforge convert data.xlsx --force

# Verbose output for debugging
pyforge convert data.xlsx --verbose

# Specify output format explicitly
pyforge convert data.xlsx --format parquet

# Custom compression (snappy is default)
pyforge convert data.xlsx --compression gzip

Multi-Sheet Processing¶

Automatic Detection¶

PyForge automatically detects and handles multiple sheets:

# Input: sales_2023.xlsx (3 sheets: Q1, Q2, Q3)
pyforge convert sales_2023.xlsx

# Output: sales_2023/ directory containing:
#   ├── Q1.parquet
#   ├── Q2.parquet
#   ├── Q3.parquet
#   └── _summary.xlsx  # Optional summary report

Column Matching¶

When sheets have similar structures, PyForge automatically detects and handles them:

# Automatic analysis with intelligent handling
pyforge convert monthly_data.xlsx

# Force combination of matching sheets
pyforge convert monthly_data.xlsx --combine

# Keep sheets separate even if they match
pyforge convert monthly_data.xlsx --separate

Sheet Processing Logic¶

PyForge analyzes your workbook and: 1. Detects sheets with matching column signatures 2. Groups compatible sheets together 3. Prompts for user preference when multiple options exist 4. Converts according to your specified flags or interactive choice

Output Formats¶

Single Sheet Workbooks¶

Input:  report.xlsx (1 sheet)
Output: report.parquet

Multi-Sheet Workbooks¶

Input:  quarterly.xlsx (4 sheets)
Output: quarterly/ directory
        ├── Q1_Data.parquet
        ├── Q2_Data.parquet  
        ├── Q3_Data.parquet
        ├── Q4_Data.parquet
        └── _summary.xlsx

Combined Output¶

pyforge convert monthly.xlsx --combine
# Output: monthly_combined.parquet (when sheets have matching columns)

Data Type Handling¶

PyForge converts all Excel data to string format for maximum compatibility:

Excel Type	Parquet Type	Notes
Numbers	string	Decimal precision preserved up to 27 places
Dates	string	ISO 8601 format (YYYY-MM-DDTHH:MM:SS)
Text	string	UTF-8 encoding
Formulas	string	Formulas are evaluated, results stored as strings
Boolean	string	"True" or "False" string values
Merged Cells	string	First cell value, warns about merged cells

String-Based Conversion

PyForge CLI currently uses a string-based conversion approach to ensure consistent behavior across all database formats (Excel, MDB, DBF). While this preserves data integrity and precision, you may need to cast types in your analysis tools (pandas, Spark, etc.) if you require native numeric or datetime types.

Performance Optimization¶

Large Files¶

# Use efficient compression for large files
pyforge convert large_data.xlsx --compression gzip

# Keep sheets separate to reduce memory usage
pyforge convert workbook.xlsx --separate

# Enable verbose output to monitor progress
pyforge convert huge_file.xlsx --verbose

Memory Management¶

PyForge automatically optimizes memory usage for large files: - Processes sheets sequentially to minimize memory footprint - Uses streaming writes for large datasets
- Provides progress feedback for long-running operations - Automatically handles memory-efficient conversion

Error Handling¶

Common Issues¶

Empty Sheets: Automatically skipped with warning

# Shows warning but continues processing
pyforge convert workbook.xlsx
# Warning: Sheet 'Empty_Sheet' is empty, skipping...

Merged Cells: Handled gracefully

# Warns about data loss potential
pyforge convert report.xlsx
# Warning: Merged cells detected in 'Summary' sheet

Formula Errors: Converts error values to null

# #DIV/0!, #N/A become null values in Parquet
pyforge convert calculations.xlsx

Validation¶

File Information¶

# Get Excel file details before conversion
pyforge info spreadsheet.xlsx

Output shows: - Number of sheets - Row counts per sheet - Column information - Data types summary

Verify Conversion¶

# Check converted file
pyforge info output.parquet

# Compare row counts
pyforge validate output.parquet --source spreadsheet.xlsx

Examples¶

Business Reports¶

# Convert financial reports with compression
pyforge convert "Q4_Financial_Report.xlsx" \
  --compression gzip \
  --verbose

# Output depends on sheet structure:
# - If sheets match: Single combined parquet file
# - If sheets differ: Separate parquet files per sheet
# - Directory structure automatically created

Data Analysis Pipeline¶

# Convert multiple files for analysis
for file in data/*.xlsx; do
  pyforge convert "$file" --compression snappy --verbose
done

# Result: Efficient parquet files ready for pandas/spark

ETL Workflow¶

# Convert with validation
pyforge validate source_data.xlsx && \
pyforge convert source_data.xlsx \
  --verbose \
  --compression gzip \
  && echo "Conversion successful" \
  || echo "Conversion failed"

Integration¶

Python Integration¶

import pandas as pd

# Read converted parquet file
df = pd.read_parquet('converted_data.parquet')

# Convert string columns to appropriate types
def convert_types(df):
    for col in df.columns:
        # Try to convert to numeric (will stay string if not possible)
        df[col] = pd.to_numeric(df[col], errors='ignore')

        # Try to convert to datetime (will stay string if not possible)  
        if df[col].dtype == 'object':
            try:
                df[col] = pd.to_datetime(df[col], errors='ignore')
            except:
                pass
    return df

# Apply type conversion
df = convert_types(df)

# Multiple sheets with type conversion
import os
parquet_dir = 'multi_sheet_data/'
sheets = {}
for file in os.listdir(parquet_dir):
    if file.endswith('.parquet'):
        sheet_name = file.replace('.parquet', '')
        df = pd.read_parquet(f'{parquet_dir}/{file}')
        sheets[sheet_name] = convert_types(df)

Spark Integration¶

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import *

spark = SparkSession.builder.appName("ExcelData").getOrCreate()

# Read parquet file (all columns will be strings)
df = spark.read.parquet('converted_data.parquet')

# Convert specific columns to appropriate types
df_typed = df.select(
    col("id").cast(IntegerType()).alias("id"),
    col("amount").cast(DoubleType()).alias("amount"), 
    col("date").cast(TimestampType()).alias("date"),
    col("description")  # Keep as string
)

df_typed.show()

Troubleshooting¶

Common Solutions¶

Permission Errors:

# Ensure file isn't open in Excel
pyforge convert data.xlsx --force

Memory Issues:

# Process sheets separately
pyforge convert large_file.xlsx --separate

Excel File Issues:

# Use verbose output for detailed error information
pyforge convert data.xlsx --verbose

Debug Mode¶

# Get detailed processing information
pyforge convert data.xlsx --verbose

Best Practices¶

Preview First: Use pyforge info to understand your Excel structure
Use Compression: GZIP provides good balance of size and speed
Validate Output: Check row counts and data types after conversion
Interactive Mode: Use for complex workbooks you haven't seen before
Batch Processing: Convert multiple files using shell loops or scripts

Performance Notes¶

Sheet Processing: Parallel processing for multiple sheets
Memory Efficient: Streams data to avoid loading entire file in memory
Compression: Significant space savings with GZIP compression
Type Optimization: Automatically optimizes data types for Parquet format

For more advanced usage, see the CLI Reference for complete option details.