This project provides tools to generate sample data to a Delta Lake table and stream changes to ClickHouse using Change Data Feed (CDF).
- INSERTS / UPDATES support only (DELETEs are ignored).
- The data generator generates a fixed schema for the Delta table.
- Python 3.8+
- AWS credentials configured with access to S3
- ClickHouse server (local or cloud)
- Required Python packages (install with
pip install -r requirements.txt
):
First, let's generate some sample data to a Delta Lake table in S3:
python data_generator.py -p s3://your-bucket/path/to/deltalake/table -r us-east-1
Options:
-p, --bucket_path
: S3 path where the Delta table will be stored (required)-r, --delta_region
: AWS region for the S3 bucket (default: us-east-1)-b, --batch-size
: Number of rows per batch (default: 10000)
You can query the Delta Lake table directly from ClickHouse using the DeltaLake table engine:
CREATE TABLE my_delta_table
ENGINE = DeltaLake('s3://your-bucket/path/to/table')
Create a table in ClickHouse to store the CDC changes. The schema should match your Delta table with the metadata columns:
CREATE TABLE default.my_cdc_table
(
`id` String,
`name` String,
`age` Int64,
`created_at` DateTime,
`_change_type` String,
`_commit_version` Int64,
`_commit_timestamp` DateTime
)
ENGINE = ReplacingMergeTree(`_commit_version`)
PARTITION BY toYYYYMM(`created_at`)
ORDER BY (name, age)
SETTINGS index_granularity = 8192;
Run the CDC script to stream changes from the Delta Lake table to ClickHouse:
python main.py \
-p "s3://your-bucket/path/to/table" \
-r "us-east-2" \
-t "default.my_cdc_table" \
-H "host.us-west-2.aws.clickhouse.cloud" \
-u "default" \
-P "password" \
--access-key "[EXAMPLE]" \
--secret-key "[EXAMPLE]" \
-v 1