CSV to Parquet Conversion¶
Convert CSV, TSV, and delimited text files to efficient Parquet format with automatic delimiter and encoding detection.
Overview¶
PyForge CLI provides intelligent CSV to Parquet conversion with:
- Automatic delimiter detection (comma, semicolon, tab, pipe)
- Encoding auto-detection (UTF-8, Latin-1, Windows-1252, UTF-16)
- Header detection and flexible handling
- String-based conversion for data consistency
- Progress tracking with detailed reports
- Compression options for optimal storage
Basic Usage¶
Simple Conversion¶
With Custom Output¶
# Specify output file
pyforge convert sales_data.csv reports/sales.parquet
# Convert to specific directory
pyforge convert dataset.csv processed/
Auto-Detection Features¶
Delimiter Detection¶
PyForge automatically detects common delimiters:
# Comma-separated (default)
pyforge convert data.csv
# Automatically detects semicolon
pyforge convert european_data.csv
# Automatically detects tab-separated
pyforge convert data.tsv
# Automatically detects pipe-separated
pyforge convert legacy_data.txt
Supported Delimiters: - Comma (,): Standard CSV format - Semicolon (;): European CSV format - Tab (\t): TSV format - Pipe (|): Legacy database exports
Encoding Detection¶
Automatic encoding detection handles international data:
# UTF-8 files (default)
pyforge convert modern_data.csv
# Latin-1/ISO-8859-1 encoded files
pyforge convert european_legacy.csv
# Windows-1252 encoded files
pyforge convert windows_export.csv
Supported Encodings: - UTF-8: Modern standard encoding - Latin-1 (ISO-8859-1): Western European - Windows-1252: Windows default encoding - UTF-16: Unicode with BOM
Header Detection¶
Smart header detection:
# Files with headers (automatically detected)
pyforge convert data_with_headers.csv
# Files without headers (auto-generates column names)
pyforge convert raw_data.csv
Advanced Options¶
Compression Options¶
# Use GZIP compression (recommended for storage)
pyforge convert data.csv --compression gzip
# Use Snappy compression (default, faster)
pyforge convert data.csv --compression snappy
# No compression
pyforge convert data.csv --compression none
Processing Options¶
# Force overwrite existing files
pyforge convert data.csv output.parquet --force
# Verbose output with detailed processing info
pyforge convert data.csv --verbose
# Specify output format explicitly
pyforge convert data.csv --format parquet
Data Type Handling¶
PyForge converts all CSV data to string format for maximum compatibility:
CSV Content | Parquet Type | Notes |
---|---|---|
Numbers | string | Decimal precision preserved as text |
Dates | string | Date format preserved as-is |
Text | string | UTF-8 encoded |
Mixed Types | string | No data type inference conflicts |
Empty Values | string | Preserved as empty strings |
Special Characters | string | International characters supported |
String-Based Conversion
PyForge CLI uses a string-based conversion approach to ensure consistent behavior across all data formats (Excel, MDB, DBF, CSV). While this preserves data integrity and precision, you may need to cast types in your analysis tools (pandas, Spark, etc.) if you require native numeric or datetime types.
File Format Support¶
Input Formats¶
Extension | Description | Auto-Detection |
---|---|---|
.csv |
Comma-Separated Values | ✅ Full support |
.tsv |
Tab-Separated Values | ✅ Full support |
.txt |
Delimited text files | ✅ Full support |
Output Format¶
- Parquet: Columnar storage with compression and efficient analytics support
Error Handling¶
Common Issues and Solutions¶
Encoding Problems:
# PyForge automatically detects encoding
# If issues occur, check verbose output
pyforge convert problematic.csv --verbose
Delimiter Detection Issues:
# Check file content and delimiter detection
pyforge info suspicious.csv
# Use verbose mode to see detection details
pyforge convert file.csv --verbose
Large Files:
# Use compression for large files
pyforge convert huge_dataset.csv --compression gzip
# Monitor progress with verbose output
pyforge convert large_file.csv --verbose
Mixed Encodings:
Validation and Inspection¶
Pre-conversion Analysis¶
Shows: - File size and estimated row count - Detected encoding and confidence - Detected delimiter and quote character - Header detection status - File modification date
File Validation¶
Performance Optimization¶
Large File Processing¶
# Optimize for large CSV files
pyforge convert massive_dataset.csv \
--compression gzip \
--verbose
# Monitor memory usage and processing time
pyforge convert big_file.csv --verbose
Batch Processing¶
# Convert multiple CSV files
for csv_file in data/*.csv; do
echo "Converting: $csv_file"
pyforge convert "$csv_file" \
--compression gzip \
--verbose
done
Examples¶
Business Data Processing¶
# Convert sales data with automatic detection
pyforge convert "Q4_Sales_Report.csv" \
--compression gzip \
--verbose
# Output includes detection details and conversion summary
International Data¶
# Handle European CSV with semicolon delimiters
pyforge convert european_sales.csv \
--verbose
# Automatic detection handles:
# - Semicolon delimiters
# - European encoding (Latin-1/Windows-1252)
# - International characters
Legacy System Migration¶
# Convert old database exports
pyforge convert legacy_export.txt \
--compression gzip \
--force
# Handles various delimiters and encodings automatically
Data Analysis Pipeline¶
# Convert for analysis workflow
pyforge convert raw_data.csv processed_data.parquet \
--compression snappy \
--verbose
# Result: Efficient Parquet file ready for pandas/Spark
Integration Examples¶
Python/Pandas¶
import pandas as pd
# Read converted parquet file (all columns are strings)
df = pd.read_parquet('converted_data.parquet')
# Convert string columns to appropriate types
def convert_csv_types(df):
for col in df.columns:
# Try to convert to numeric (will stay string if not possible)
df[col] = pd.to_numeric(df[col], errors='ignore')
# Try to convert to datetime (will stay string if not possible)
if df[col].dtype == 'object':
try:
df[col] = pd.to_datetime(df[col], errors='ignore')
except:
pass
return df
# Apply type conversion
df = convert_csv_types(df)
# Now you can perform analysis with proper types
print(f"Records: {len(df)}")
print(f"Columns: {list(df.columns)}")
print(f"Data types after conversion:\n{df.dtypes}")
Spark Integration¶
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import *
spark = SparkSession.builder.appName("CSVData").getOrCreate()
# Read converted parquet file (all columns are strings)
df = spark.read.parquet('converted_data.parquet')
# Convert specific columns to appropriate types
df_typed = df.select(
col("id").cast(IntegerType()).alias("id"),
col("amount").cast(DoubleType()).alias("amount"),
col("date").cast(DateType()).alias("date"),
col("description") # Keep as string
)
df_typed.show()
Troubleshooting¶
Common Solutions¶
File Not Recognized:
Detection Failures:
# Use verbose mode to see detection process
pyforge convert file.csv --verbose
# Check file has content and proper structure
Memory Issues:
Character Encoding Issues:
# Auto-detection usually handles this
# Check verbose output for encoding confidence
pyforge convert file.csv --verbose
Debug Information¶
This shows: - Encoding detection process and confidence - Delimiter detection results - Header detection decision - Row and column counts - Conversion statistics - Processing time
Best Practices¶
- Use Verbose Mode: Always use
--verbose
for important conversions to verify detection accuracy - Validate First: Use
pyforge info
andpyforge validate
before converting critical data - Choose Compression: Use GZIP for storage, Snappy for speed
- Batch Process: Convert multiple files using shell scripts for efficiency
- Verify Output: Check converted data structure and content
- Handle Large Files: Monitor memory usage for very large CSV files
Character Encoding Reference¶
Common CSV encodings by source:
Source | Encoding | Description |
---|---|---|
Modern Systems | UTF-8 | Unicode standard, handles all characters |
Windows Excel | Windows-1252 | Windows default, Western European |
European Systems | ISO-8859-1 (Latin-1) | Western European characters |
Legacy Systems | ASCII | Basic English characters only |
International | UTF-16 | Unicode with BOM marker |
For complete command options and advanced features, see the CLI Reference.