Open lakehouse migration toolkit
Internal tool for migrating from Snowflake / BigQuery to Apache Iceberg on S3, including schema inference, partition mapping, and reconciliation reports.
Tech stack
Problem
Every lakehouse migration project I ran required the same manual steps: extract schema from the source warehouse, map partition strategies to Iceberg partition specs, run a reconciliation query after load to validate row counts and key statistics. This took 1-2 weeks per project.
What I built
A Python CLI tool that automates the repetitive steps:
Schema extraction: connectors for Snowflake, BigQuery, Redshift, and Delta Lake; outputs a normalised schema manifest (JSON) with column types, constraints, and partition specs.
Partition mapping: rule engine that maps source partition expressions (e.g. Snowflake CLUSTER BY) to Iceberg partition transforms (identity, bucket, truncate, year/month/day/hour).
Load orchestration: Spark job templates (PySpark) generated from the schema manifest, with configurable write mode (snapshot, overwrite, append) and Iceberg table properties.
Reconciliation: after load, runs a configurable set of checks (row count, null rate, min/max per numeric column, sample hash comparison) and produces a pass/fail HTML report.
Results
Reduced migration project setup time from 1-2 weeks to 1-2 days across four production migrations. The schema extraction module is now used independently by two other teams for documentation generation.