PyForge CLI Sample Datasets¶
A comprehensive collection of sample datasets for testing PyForge CLI data processing capabilities across multiple file formats.
Overview¶
The PyForge CLI Sample Datasets collection provides 19 curated datasets across 7 different formats, enabling comprehensive testing of data conversion, processing, and analysis workflows. All datasets are automatically downloadable through the PyForge CLI installation command.
Installation¶
Example:
# Install to current directory
pyforge install sample-datasets .
# Install to specific directory
pyforge install sample-datasets /path/to/datasets
# Install to data folder
pyforge install sample-datasets ./data
Dataset Categories¶
Size Categories¶
- Small: <100MB - Ideal for quick testing and development
 - Medium: 100MB-1GB - Suitable for performance testing
 - Large: >1GB - For stress testing and production validation
 
Format Coverage¶
- PDF: Government documents and technical reports
 - Excel: Business data with multi-sheet structures
 - XML: API responses and structured data
 - Access: Database files (.mdb/.accdb)
 - DBF: Geographic and legacy database formats
 - MDF: SQL Server database files
 - CSV: Analytics and machine learning datasets
 
Available Datasets¶
📄 PDF Files (1 dataset)¶
NIST Cybersecurity Framework¶
- Size: 1.0MB (Small)
 - Format: PDF
 - License: Public Domain (US Government)
 - Description: NIST Cybersecurity Framework guidelines
 - Use Cases: Technical document analysis, security compliance
 - Download: Direct HTTP
 - Status: ✅ Working
 
📊 Excel Files (3 datasets)¶
Global Superstore¶
- Size: 17.4MB (Small)
 - Format: Excel
 - License: Other (specified)
 - Description: Global e-commerce sales data 2011-2014
 - Use Cases: International data processing, time series analysis
 - Download: Kaggle API (
shekpaul/global-superstore) - Status: ✅ Working
 
COVID Dashboard¶
- Size: 250.4KB (Small)
 - Format: Excel
 - License: Public
 - Description: Interactive COVID-19 analysis with embedded charts
 - Use Cases: Dashboard processing, chart extraction, health data
 - Download: Kaggle API (
suhj22/covid19-excel-dataset-with-interactive-dashboard) - Status: ✅ Working
 
Financial Sample¶
- Size: 81.5KB (Small)
 - Format: Excel
 - License: Public
 - Description: Financial statements and analysis
 - Use Cases: Financial data processing, accounting workflows
 - Download: Kaggle API (
konstantinognev/financial-samplexlsx) - Status: ✅ Working
 
🔗 XML Files (1 dataset)¶
USPTO Patent Data¶
- Size: 568.8MB (Medium)
 - Format: XML
 - License: CC Public Domain Mark 1.0
 - Description: Full-text patent grants from USPTO
 - Use Cases: Government XML processing, legal documents, complex structures
 - Download: Kaggle API (
uspto/patent-grant-full-text) - Status: ✅ Working
 
🗃️ Access Database Files (3 datasets)¶
Northwind 2007 (VB.NET)¶
- Size: 3.5MB (Small)
 - Format: ACCDB (Access 2007+)
 - License: Educational/Sample Use
 - Description: Classic Northwind sample database used in VB.NET examples
 - Use Cases: Database connectivity, business data modeling, relational data
 - Download: Direct HTTP (GitHub: 
ssmith1975/samples-vb-net) - Status: ✅ Working
 
Sample Database (Dibi)¶
- Size: 284KB (Small)
 - Format: MDB (Access 97/2000/2003)
 - License: Open Source
 - Description: Small sample database for testing database abstraction layer
 - Use Cases: Legacy database testing, compatibility validation
 - Download: Direct HTTP (GitHub: 
dg/dibi) - Status: ✅ Working
 
Sakila (Access Port)¶
- Size: 3.8MB (Small)
 - Format: MDB (Access 97/2000/2003)
 - License: BSD License
 - Description: MySQL Sakila sample database ported to Access format
 - Use Cases: Cross-platform database testing, movie rental business model
 - Download: Direct HTTP (GitHub: 
ozzymcduff/sakila-sample-database-ports) - Status: ✅ Working
 
📋 DBF Files (3 datasets)¶
Census TIGER Sample¶
- Size: 175KB (Small)
 - Format: DBF (dBase)
 - License: Public Domain (US Government)
 - Description: US Census TIGER geographic place data
 - Use Cases: Geographic data processing, legacy format support
 - Download: Direct HTTP (ZIP extraction)
 - Status: ✅ Working
 
Property Sample¶
- Size: 75MB (Small)
 - Format: DBF (dBase)
 - License: Public Domain (US Government)
 - Description: US Census tabulation blocks geographic data
 - Use Cases: Large DBF handling, geographic analysis
 - Download: Direct HTTP (ZIP extraction)
 - Status: ✅ Working
 
County Geographic¶
- Size: 970KB (Small)
 - Format: DBF (dBase)
 - License: Public Domain (US Government)
 - Description: US Census county geographic boundaries
 - Use Cases: Administrative boundaries, county-level analysis
 - Download: Direct HTTP (ZIP extraction)
 - Status: ✅ Working
 
🗄️ MDF Files (2 datasets)¶
AdventureWorks 2012 OLTP LT¶
- Size: 5.9MB (Small)
 - Format: MDF (SQL Server)
 - License: Microsoft Sample Code License
 - Description: Microsoft AdventureWorks OLTP lightweight sample database
 - Use Cases: SQL Server testing, OLTP processing, business applications
 - Download: Direct HTTP (Microsoft GitHub)
 - Status: ✅ Working
 
AdventureWorks 2012 DW¶
- Size: 201.2MB (Medium)
 - Format: MDF (SQL Server)
 - License: Microsoft Sample Code License
 - Description: Microsoft AdventureWorks Data Warehouse sample database
 - Use Cases: Data warehouse testing, OLAP processing, analytics
 - Download: Direct HTTP (Microsoft GitHub)
 - Status: ✅ Working
 
📈 CSV Files (5 datasets)¶
Titanic Dataset¶
- Size: 59.8KB (Small)
 - Format: CSV
 - License: Public Domain
 - Description: Classic passenger survival dataset
 - Use Cases: Machine learning, classification problems, missing values
 - Download: Kaggle API (
yasserh/titanic-dataset) - Status: ✅ Working
 
Wine Quality¶
- Size: 76.2KB (Small)
 - Format: CSV
 - License: Public Domain
 - Description: Chemical properties and quality ratings
 - Use Cases: Scientific data, regression analysis, quality prediction
 - Download: Kaggle API (
yasserh/wine-quality-dataset) - Status: ✅ Working
 
UK E-Commerce Data¶
- Size: 43.5MB (Small)
 - Format: CSV
 - License: Public Domain
 - Description: UK online retail transactions
 - Use Cases: E-commerce analysis, international data, business transactions
 - Download: Kaggle API (
carrie1/ecommerce-data) - Status: ✅ Working
 
Credit Card Fraud¶
- Size: 143.8MB (Medium)
 - Format: CSV
 - License: Open Database License
 - Description: European credit card fraud detection dataset
 - Use Cases: Fraud detection, imbalanced datasets, financial security
 - Download: Kaggle API (
mlg-ulb/creditcardfraud) - Status: ✅ Working
 
PaySim Financial¶
- Size: 470.7MB (Medium)
 - Format: CSV
 - License: CC BY-SA 4.0
 - Description: Synthetic mobile money transactions
 - Use Cases: Financial simulation, large dataset processing, fraud detection
 - Download: Kaggle API (
ealaxi/paysim1) - Status: ✅ Working
 
Download Methods¶
Direct HTTP Downloads (9 datasets - 47%)¶
Direct downloads from reliable sources requiring no authentication: - Government websites (Census, NIST) - GitHub repositories (Microsoft, open source projects)
Kaggle API Downloads (10 datasets - 53%)¶
Programmatic access through Kaggle API: - Requires Kaggle account and API token - Automatic authentication handling - Community datasets with clear licensing
License Information¶
Public Domain (11 datasets)¶
- US Government data (Census, NIST, DOD)
 - Community contributions
 - No usage restrictions
 
Open Source Licenses (6 datasets)¶
- MIT, BSD, Apache licenses
 - Attribution required
 - Commercial use allowed
 
Educational/Sample Use (4 datasets)¶
- Microsoft sample databases
 - Educational projects
 - Learning and development purposes
 
Creative Commons (2 datasets)¶
- CC0, CC BY-SA licenses
 - Open access with attribution
 
Technical Specifications¶
File Organization¶
sample-datasets/
├── pdf/
│   ├── small/
│   ├── medium/
│   └── large/
├── excel/
│   ├── small/
│   ├── medium/
│   └── large/
├── xml/
│   ├── small/
│   ├── medium/
│   └── large/
├── access/
│   ├── small/
│   ├── medium/
│   └── large/
├── dbf/
│   ├── small/
│   ├── medium/
│   └── large/
├── mdf/
│   ├── small/
│   ├── medium/
│   └── large/
├── csv/
│   ├── small/
│   ├── medium/
│   └── large/
└── metadata/
    ├── manifest.json
    ├── checksums.sha256
    └── download_results.json
Metadata Standards¶
- Source Attribution: Original URL, license, collection date
 - File Characteristics: Size, format version, encoding
 - Testing Properties: Complexity level, special features
 - Quality Metrics: Validation status, integrity checks
 
Integrity Verification¶
- SHA256 checksums for all files
 - Download validation and retry logic
 - File corruption detection
 - Source availability monitoring
 
Usage Examples¶
Basic Data Processing¶
# Download all datasets
pyforge install sample-datasets ./data
# Process PDF files
pyforge convert ./data/pdf/small/ --output ./processed/
# Analyze Excel files
pyforge convert ./data/excel/ --format parquet
# Handle large CSV files
pyforge convert ./data/csv/large/ --streaming
Format-Specific Testing¶
# Test database connectivity
pyforge connect ./data/access/small/Northwind_2007_VBNet.accdb
# Process geographic data
pyforge convert ./data/dbf/ --projection WGS84
# Extract XML elements
pyforge convert ./data/xml/small/ --xpath "//item/title"
Performance Benchmarking¶
# Small file performance
pyforge benchmark ./data/*/small/
# Large file stress testing
pyforge benchmark ./data/csv/large/ --memory-limit 1GB
# Format comparison
pyforge benchmark ./data/ --compare-formats
Troubleshooting¶
Common Issues¶
Download Failures¶
- SSL Certificate Issues: Some government sites may have certificate problems
 - Kaggle Authentication: Ensure API token is properly configured
 - Network Timeouts: Large files may require stable internet connection
 
File Access Problems¶
- Permissions: Ensure write access to target directory
 - Disk Space: Large datasets require sufficient storage
 - Format Support: Verify PyForge CLI format compatibility
 
Performance Issues¶
- Memory Usage: Large files may require streaming processing
 - Processing Time: Complex formats take longer to convert
 - Concurrent Access: Multiple processes may impact performance
 
Support Resources¶
- Documentation: PyForge CLI Docs
 - Issue Tracking: GitHub Issues
 - Community: Discussions
 
Statistics¶
Success Rates¶
- Overall: 19/19 datasets working (100%)
 - PDF: 1/1 working (100%)
 - Excel: 3/3 working (100%)
 - XML: 1/1 working (100%)
 - Access: 3/3 working (100%)
 - DBF: 3/3 working (100%)
 - MDF: 2/2 working (100%)
 - CSV: 6/6 working (100%)
 
Size Distribution¶
- Small (<100MB): 13 datasets (68%)
 - Medium (100MB-1GB): 6 datasets (32%)
 - Large (>1GB): 0 datasets (0%)
 
Total Collection Size¶
- Compressed: ~1.5GB
 - Uncompressed: ~2.8GB
 - Average per dataset: ~130MB
 
Last updated: 2025-06-24 Version: 1.0.0 PyForge CLI Sample Datasets Collection