Skip to content

Add filter pushdown optimizations #300

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 42 commits into from
Mar 24, 2025
Merged

Add filter pushdown optimizations #300

merged 42 commits into from
Mar 24, 2025

Conversation

knassre-bodo
Copy link
Contributor

@knassre-bodo knassre-bodo commented Mar 17, 2025

Resolves #302. Set up a relational optimization pipeline after relational conversion, and add an optimization to push filters down as far as possible. Also allows conditions to be pushed into the RHS of a left join (turning it into an inner join) if the RHS being null implies that the filter would output False. Additionally, converted many of the e2e tests (tpch queries, correlated queries) to also test the generated SQL.

@@ -1,18 +1,16 @@
ROOT(columns=[('n', n)], orderings=[])
PROJECT(columns={'n': agg_0})
AGGREGATE(keys={}, aggregations={'agg_0': COUNT()})
FILTER(condition=True:bool, columns={'account_balance': account_balance})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice side effect: these True filters also get deleted.

Comment on lines -14 to +15
FILTER(condition=NOT(STARTSWITH(name, 'A':string)), columns={'key': key, 'region_name': region_name})
PROJECT(columns={'key': key, 'name': name, 'region_name': name})
PROJECT(columns={'key': key, 'region_name': name})
FILTER(condition=NOT(STARTSWITH(name, 'A':string)), columns={'key': key, 'name': name})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple example: filter got pushed before a project

Comment on lines +31 to 32
FILTER(condition=size == 15:int64 & ENDSWITH(part_type, 'BRASS':string), columns={'key': key, 'manufacturer': manufacturer})
SCAN(table=tpch.PART, columns={'key': p_partkey, 'manufacturer': p_mfgr, 'part_type': p_type, 'size': p_size})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good example: these conditions about the size & part type got pushed all the way to the scan

Comment on lines -11 to -15
FILTER(condition=YEAR(order_date) == 1992:int64, columns={'customer_key': customer_key, 'part_type': part_type, 'supp_region': supp_region})
JOIN(conditions=[t0.order_key == t1.key], types=['inner'], columns={'customer_key': t1.customer_key, 'order_date': t1.order_date, 'part_type': t0.part_type, 'supp_region': t0.supp_region})
PROJECT(columns={'order_key': order_key, 'part_type': part_type, 'supp_region': name_7})
JOIN(conditions=[t0.supplier_key == t1.key], types=['left'], columns={'name_7': t1.name_7, 'order_key': t0.order_key, 'part_type': t0.part_type})
FILTER(condition=MONTH(ship_date) == 6:int64 & YEAR(ship_date) == 1992:int64, columns={'order_key': order_key, 'part_type': part_type, 'supplier_key': supplier_key})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good examples: these year/month filters got pushed way down to the scans to order/lineitem.

Comment on lines +12 to +14
JOIN(conditions=[t0.region_key == t1.key], types=['inner'], columns={'key': t0.key})
SCAN(table=tpch.NATION, columns={'key': n_nationkey, 'region_key': n_regionkey})
FILTER(condition=name == 'EUROPE':string, columns={'key': key})
Copy link
Contributor Author

@knassre-bodo knassre-bodo Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting case. In the original version, the filter name_3 == EUROPE happened after a left-join, and name_3 returned to a column from the RHS of the left-join. In the new version, we've pushed this filter into the RHS of the join, which has become an inner join, because any row that would have been a null match due to the left join would get deleted by the filter anyway. Therefore, the left join + filter is the same as filtering the input & doing an inner join.

@@ -2,8 +2,8 @@ ROOT(columns=[('name', name), ('okey', okey)], orderings=[])
PROJECT(columns={'name': name, 'okey': key_2})
JOIN(conditions=[t0.key == t1.nation_key], types=['left'], columns={'key_2': t1.key_2, 'name': t0.name})
SCAN(table=tpch.NATION, columns={'key': n_nationkey, 'name': n_name})
FILTER(condition=key_2 == 454791:int64, columns={'key_2': key_2, 'nation_key': nation_key})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another simple example: previously this filter happened after the join, and now it got pushed into the RHS input.

JOIN(conditions=[t0.part_key == t1.key], types=['inner'], columns={'brand': t1.brand, 'container': t1.container, 'discount': t0.discount, 'extended_price': t0.extended_price, 'quantity': t0.quantity, 'size': t1.size})
FILTER(condition=ship_instruct == 'DELIVER IN PERSON':string & ISIN(ship_mode, ['AIR', 'AIR REG']:array[unknown]), columns={'discount': discount, 'extended_price': extended_price, 'part_key': part_key, 'quantity': quantity})
SCAN(table=tpch.LINEITEM, columns={'discount': l_discount, 'extended_price': l_extendedprice, 'part_key': l_partkey, 'quantity': l_quantity, 'ship_instruct': l_shipinstruct, 'ship_mode': l_shipmode})
FILTER(condition=size >= 1:int64, columns={'brand': brand, 'container': container, 'key': key, 'size': size})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the conjunction got pushed down

@@ -6,14 +6,14 @@ ROOT(columns=[('NATION', NATION), ('O_YEAR', O_YEAR), ('AMOUNT', AMOUNT)], order
PROJECT(columns={'nation_name': nation_name, 'o_year': YEAR(order_date), 'value': extended_price * 1:int64 - discount - supplycost * quantity})
JOIN(conditions=[t0.order_key == t1.key], types=['left'], columns={'discount': t0.discount, 'extended_price': t0.extended_price, 'nation_name': t0.nation_name, 'order_date': t1.order_date, 'quantity': t0.quantity, 'supplycost': t0.supplycost})
JOIN(conditions=[t0.part_key == t1.part_key & t0.supplier_key == t1.supplier_key], types=['inner'], columns={'discount': t1.discount, 'extended_price': t1.extended_price, 'nation_name': t0.nation_name, 'order_key': t1.order_key, 'quantity': t1.quantity, 'supplycost': t0.supplycost})
FILTER(condition=CONTAINS(name_7, 'green':string), columns={'nation_name': nation_name, 'part_key': part_key, 'supplier_key': supplier_key, 'supplycost': supplycost})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is also pushable beneath a left join (turning it to inner) since if name_7 is null, the CONTAINS condition will be False.

Base automatically changed from kian/decorell_opt to main March 19, 2025 15:28
Comment on lines 478 to 479
if aggregations:
query = query.group_by(*keys)
query = query.group_by(*sorted(keys))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes a nondeterminism issue in the generated SQL

),
],
)
def test_tpch_relational_to_sqlite_sql(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant with test_pipeline_until_sql_tpch

@knassre-bodo knassre-bodo marked this pull request as ready for review March 19, 2025 16:45
Copy link
Contributor

@hadia206 hadia206 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
Left some questions and minor suggestions

pushable_filters, remaining_filters = set(), filters
else:
# Otherwise push all filters that only depend on on columns in
# the project that are pass-through of another column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding a simple comment example will be useful

original_name: str
if columns is None:
for original_name in node.calc_terms:
name = renamings.get(original_name, original_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that meant to be passing same variable twice?

Copy link
Contributor Author

@knassre-bodo knassre-bodo Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, its saying that if original_name is not a key in renamings, then just use original_name instead of doing the dictionary lookup.

Comment on lines +53 to +56
"""
A set of operators with the property that the output is null if any of the
inputs are null.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be above the list?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reasoning is so that when you hover on the variable in VSCode the details come up on the tip tool .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the convention is for it to be below.

Comment on lines +269 to +270
if new_column.input_name is not None:
new_column = new_column.with_input(None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, why remove input name from the column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the input name is used by join to say which input the column comes from, but now we are pushing into the input itself so that input index is no longer valid.

@@ -375,6 +375,38 @@ def defog_test_data(
return request.param


def test_defog_until_sql(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the test is confusing. Doesn't match the docstring description

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that it tests each defog query by running it through the pipeline up until SQL conversion. The ones that go up to relational plans are called _until_relational, and the ones that go all the way to runtime execution are called e2e.

Copy link
Contributor

@vineetg3 vineetg3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@knassre-bodo knassre-bodo merged commit 6e77681 into main Mar 24, 2025
5 checks passed
@knassre-bodo knassre-bodo deleted the kian/filter_opt branch March 24, 2025 05:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable filter pushdown optimizations
3 participants