Open lakehouse migration toolkit

Internal tool for migrating from Snowflake / BigQuery to Apache Iceberg on S3, including schema inference, partition mapping, and reconciliation reports.

Tech stack

PythonApache IcebergApache SparkSnowflakeBigQueryAWS S3TerraformClickJinja2

Problem

Every lakehouse migration project I ran required the same manual steps: extract schema from the source warehouse, map partition strategies to Iceberg partition specs, run a reconciliation query after load to validate row counts and key statistics. This took 1-2 weeks per project.

What I built

A Python CLI tool that automates the repetitive steps:

Schema extraction: connectors for Snowflake, BigQuery, Redshift, and Delta Lake; outputs a normalised schema manifest (JSON) with column types, constraints, and partition specs.

Partition mapping: rule engine that maps source partition expressions (e.g. Snowflake CLUSTER BY) to Iceberg partition transforms (identity, bucket, truncate, year/month/day/hour).

Load orchestration: Spark job templates (PySpark) generated from the schema manifest, with configurable write mode (snapshot, overwrite, append) and Iceberg table properties.

Reconciliation: after load, runs a configurable set of checks (row count, null rate, min/max per numeric column, sample hash comparison) and produces a pass/fail HTML report.

Results

Reduced migration project setup time from 1-2 weeks to 1-2 days across four production migrations. The schema extraction module is now used independently by two other teams for documentation generation.

Continue

PreviousGenAI evaluation framework Nextdbt lineage impact analyzer