Skip to content

ClickHouse/deltalake-cdc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Delta Lake to ClickHouse CDC Pipeline

This project provides tools to generate sample data to a Delta Lake table and stream changes to ClickHouse using Change Data Feed (CDF).

Limitations

  • INSERTS / UPDATES support only (DELETEs are ignored).
  • The data generator generates a fixed schema for the Delta table.

Prerequisites

  • Python 3.8+
  • AWS credentials configured with access to S3
  • ClickHouse server (local or cloud)
  • Required Python packages (install with pip install -r requirements.txt):

1. Generate Sample Data

First, let's generate some sample data to a Delta Lake table in S3:

python data_generator.py -p s3://your-bucket/path/to/deltalake/table -r us-east-1

Options:

  • -p, --bucket_path: S3 path where the Delta table will be stored (required)
  • -r, --delta_region: AWS region for the S3 bucket (default: us-east-1)
  • -b, --batch-size: Number of rows per batch (default: 10000)

2. Query Delta Lake from ClickHouse

You can query the Delta Lake table directly from ClickHouse using the DeltaLake table engine:

CREATE TABLE my_delta_table
    ENGINE = DeltaLake('s3://your-bucket/path/to/table')

3. Create Destination Table in ClickHouse

Create a table in ClickHouse to store the CDC changes. The schema should match your Delta table with the metadata columns:

CREATE TABLE default.my_cdc_table
(
    `id` String,
    `name` String,
    `age` Int64,
    `created_at` DateTime,
    `_change_type` String,
    `_commit_version` Int64,
    `_commit_timestamp` DateTime
)
ENGINE = ReplacingMergeTree(`_commit_version`)
PARTITION BY toYYYYMM(`created_at`)
ORDER BY (name, age)
SETTINGS index_granularity = 8192;

4. Run the CDC Script

Run the CDC script to stream changes from the Delta Lake table to ClickHouse:

python main.py \
    -p "s3://your-bucket/path/to/table" \
    -r "us-east-2" \
    -t "default.my_cdc_table" \
    -H "host.us-west-2.aws.clickhouse.cloud" \
    -u "default" \
    -P "password" \
    --access-key "[EXAMPLE]" \
    --secret-key "[EXAMPLE]" \
    -v 1

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages