Add filter pushdown optimizations #300

knassre-bodo · 2025-03-17T19:54:38Z

Resolves #302. Set up a relational optimization pipeline after relational conversion, and add an optimization to push filters down as far as possible. Also allows conditions to be pushed into the RHS of a left join (turning it into an inner join) if the RHS being null implies that the filter would output False. Additionally, converted many of the e2e tests (tpch queries, correlated queries) to also test the generated SQL.

…the prev/next tests)

…N CI]

…after pullup trnasformation to remove now-defunct children and renumber the remainder [RUN CI]

…oval, and name collisions, including multiple pullups happening together [RUN CI]

knassre-bodo · 2025-03-18T15:33:30Z

tests/test_plan_refsols/correl_14.txt

@@ -1,18 +1,16 @@
 ROOT(columns=[('n', n)], orderings=[])
 PROJECT(columns={'n': agg_0})
  AGGREGATE(keys={}, aggregations={'agg_0': COUNT()})
-   FILTER(condition=True:bool, columns={'account_balance': account_balance})


Nice side effect: these True filters also get deleted.

knassre-bodo · 2025-03-18T15:34:19Z

tests/test_plan_refsols/correl_2.txt

-         FILTER(condition=NOT(STARTSWITH(name, 'A':string)), columns={'key': key, 'region_name': region_name})
-          PROJECT(columns={'key': key, 'name': name, 'region_name': name})
+         PROJECT(columns={'key': key, 'region_name': name})
+          FILTER(condition=NOT(STARTSWITH(name, 'A':string)), columns={'key': key, 'name': name})


Simple example: filter got pushed before a project

knassre-bodo · 2025-03-18T15:35:05Z

tests/test_plan_refsols/tpch_q2.txt

+       FILTER(condition=size == 15:int64 & ENDSWITH(part_type, 'BRASS':string), columns={'key': key, 'manufacturer': manufacturer})
        SCAN(table=tpch.PART, columns={'key': p_partkey, 'manufacturer': p_mfgr, 'part_type': p_type, 'size': p_size})


Good example: these conditions about the size & part type got pushed all the way to the scan

knassre-bodo · 2025-03-18T15:40:04Z

tests/test_plan_refsols/triple_partition.txt

-          FILTER(condition=YEAR(order_date) == 1992:int64, columns={'customer_key': customer_key, 'part_type': part_type, 'supp_region': supp_region})
-           JOIN(conditions=[t0.order_key == t1.key], types=['inner'], columns={'customer_key': t1.customer_key, 'order_date': t1.order_date, 'part_type': t0.part_type, 'supp_region': t0.supp_region})
-            PROJECT(columns={'order_key': order_key, 'part_type': part_type, 'supp_region': name_7})
-             JOIN(conditions=[t0.supplier_key == t1.key], types=['left'], columns={'name_7': t1.name_7, 'order_key': t0.order_key, 'part_type': t0.part_type})
-              FILTER(condition=MONTH(ship_date) == 6:int64 & YEAR(ship_date) == 1992:int64, columns={'order_key': order_key, 'part_type': part_type, 'supplier_key': supplier_key})


Good examples: these year/month filters got pushed way down to the scans to order/lineitem.

knassre-bodo · 2025-03-18T16:00:12Z

tests/test_plan_refsols/tpch_q2.txt

+           JOIN(conditions=[t0.region_key == t1.key], types=['inner'], columns={'key': t0.key})
+            SCAN(table=tpch.NATION, columns={'key': n_nationkey, 'region_key': n_regionkey})
+            FILTER(condition=name == 'EUROPE':string, columns={'key': key})


This is an interesting case. In the original version, the filter name_3 == EUROPE happened after a left-join, and name_3 returned to a column from the RHS of the left-join. In the new version, we've pushed this filter into the RHS of the join, which has become an inner join, because any row that would have been a null match due to the left join would get deleted by the filter anyway. Therefore, the left join + filter is the same as filtering the input & doing an inner join.

knassre-bodo · 2025-03-18T16:02:22Z

tests/test_plan_refsols/singular2.txt

@@ -2,8 +2,8 @@ ROOT(columns=[('name', name), ('okey', okey)], orderings=[])
 PROJECT(columns={'name': name, 'okey': key_2})
  JOIN(conditions=[t0.key == t1.nation_key], types=['left'], columns={'key_2': t1.key_2, 'name': t0.name})
   SCAN(table=tpch.NATION, columns={'key': n_nationkey, 'name': n_name})
-   FILTER(condition=key_2 == 454791:int64, columns={'key_2': key_2, 'nation_key': nation_key})


This is another simple example: previously this filter happened after the join, and now it got pushed into the RHS input.

knassre-bodo · 2025-03-18T16:21:56Z

tests/test_plan_refsols/tpch_q19.txt

+    JOIN(conditions=[t0.part_key == t1.key], types=['inner'], columns={'brand': t1.brand, 'container': t1.container, 'discount': t0.discount, 'extended_price': t0.extended_price, 'quantity': t0.quantity, 'size': t1.size})
+     FILTER(condition=ship_instruct == 'DELIVER IN PERSON':string & ISIN(ship_mode, ['AIR', 'AIR REG']:array[unknown]), columns={'discount': discount, 'extended_price': extended_price, 'part_key': part_key, 'quantity': quantity})
+      SCAN(table=tpch.LINEITEM, columns={'discount': l_discount, 'extended_price': l_extendedprice, 'part_key': l_partkey, 'quantity': l_quantity, 'ship_instruct': l_shipinstruct, 'ship_mode': l_shipmode})
+     FILTER(condition=size >= 1:int64, columns={'brand': brand, 'container': container, 'key': key, 'size': size})


Part of the conjunction got pushed down

knassre-bodo · 2025-03-18T16:22:29Z

tests/test_plan_refsols/tpch_q9.txt

@@ -6,14 +6,14 @@ ROOT(columns=[('NATION', NATION), ('O_YEAR', O_YEAR), ('AMOUNT', AMOUNT)], order
     PROJECT(columns={'nation_name': nation_name, 'o_year': YEAR(order_date), 'value': extended_price * 1:int64 - discount - supplycost * quantity})
      JOIN(conditions=[t0.order_key == t1.key], types=['left'], columns={'discount': t0.discount, 'extended_price': t0.extended_price, 'nation_name': t0.nation_name, 'order_date': t1.order_date, 'quantity': t0.quantity, 'supplycost': t0.supplycost})
       JOIN(conditions=[t0.part_key == t1.part_key & t0.supplier_key == t1.supplier_key], types=['inner'], columns={'discount': t1.discount, 'extended_price': t1.extended_price, 'nation_name': t0.nation_name, 'order_key': t1.order_key, 'quantity': t1.quantity, 'supplycost': t0.supplycost})
-        FILTER(condition=CONTAINS(name_7, 'green':string), columns={'nation_name': nation_name, 'part_key': part_key, 'supplier_key': supplier_key, 'supplycost': supplycost})


This condition is also pushable beneath a left join (turning it to inner) since if name_7 is null, the CONTAINS condition will be False.

knassre-bodo · 2025-03-19T15:43:11Z

pydough/sqlglot/sqlglot_relational_visitor.py

            if aggregations:
-                query = query.group_by(*keys)
+                query = query.group_by(*sorted(keys))


This fixes a nondeterminism issue in the generated SQL

knassre-bodo · 2025-03-19T15:43:53Z

tests/test_relational_to_sql.py

-        ),
-    ],
-)
-def test_tpch_relational_to_sqlite_sql(


Redundant with test_pipeline_until_sql_tpch

pydough/conversion/filter_pushdown.py

hadia206

Looks good to me.
Left some questions and minor suggestions

pydough/conversion/filter_pushdown.py

hadia206 · 2025-03-21T17:53:41Z

pydough/conversion/filter_pushdown.py

+                pushable_filters, remaining_filters = set(), filters
+            else:
+                # Otherwise push all filters that only depend on on columns in
+                # the project that are pass-through of another column.


adding a simple comment example will be useful

hadia206 · 2025-03-21T18:00:24Z

pydough/conversion/relational_converter.py

+    original_name: str
+    if columns is None:
+        for original_name in node.calc_terms:
+            name = renamings.get(original_name, original_name)


Is that meant to be passing same variable twice?

Yes, its saying that if original_name is not a key in renamings, then just use original_name instead of doing the dictionary lookup.

hadia206 · 2025-03-21T18:02:42Z

pydough/relational/rel_util.py

+"""
+A set of operators with the property that the output is null if any of the
+inputs are null.
+"""


should this be above the list?

I think the reasoning is so that when you hover on the variable in VSCode the details come up on the tip tool .

Yeah, the convention is for it to be below.

pydough/relational/rel_util.py

hadia206 · 2025-03-21T18:20:15Z

pydough/relational/rel_util.py

+            if new_column.input_name is not None:
+                new_column = new_column.with_input(None)


For my understanding, why remove input name from the column?

Because the input name is used by join to say which input the column comes from, but now we are pushing into the input itself so that input index is no longer valid.

hadia206 · 2025-03-21T18:23:09Z

tests/test_pipeline_defog.py

@@ -375,6 +375,38 @@ def defog_test_data(
    return request.param


+def test_defog_until_sql(


The name of the test is confusing. Doesn't match the docstring description

The idea is that it tests each defog query by running it through the pipeline up until SQL conversion. The ones that go up to relational plans are called _until_relational, and the ones that go all the way to runtime execution are called e2e.

vineetg3

LGTM!

Co-authored-by: Hadia Ahmed <[email protected]>

…r_opt

knassre-bodo added 23 commits March 12, 2025 13:42

Rewriting tests to further encourage the desired behavior

9961d98

WIP

a042050

Achieved decorrelation optpimization for singular correl queries 9/17

a4bec0e

Adding support for the aggregation cases (correl 6/18/19/20, some of …

531c2b0

…the prev/next tests)

[RUN CI]

9598df6

Added ANYTHING function to avoid calling MIN when possible

eac96ad

Revisions, cleanup, and comments/docstrings

8bec6a8

Added two additiona correlation stress-tests for the new behavior [RU…

f24ea90

…N CI]

Fixing correl_24 refsol [RUN CI]

0a48157

Adjusted q5, added more correl queries similar to q5, fixed behavior …

153447c

…after pullup trnasformation to remove now-defunct children and renumber the remainder [RUN CI]

Adding TOC entry for ANYTHING function

b97f8bc

Added more extreme edge case handling with PullUp operator, child rem…

46f8bfb

…oval, and name collisions, including multiple pullups happening together [RUN CI]

Removing dead code [RUN CI]

664dd45

Removing dead code [RUN CI]

6a26a80

Resolving conflicts with singular PR

de17970

Add redundant left-join pruning [RUN CI]

329551c

Refactoring correl test 29 [RUN CI]

db2c508

Adjusting correl_29 again to avoid sqlite parser stack overflow [RUN CI]

e83082c

Setting up optimizer workflow

f894353

Started filter pushdown structure

2cdf0d6

Initial implementation of filter pushdown complete

35afd55

Fixing window handling

0e4102c

[RUN CI]

5ed4d05

knassre-bodo commented Mar 18, 2025

View reviewed changes

[RUN CI]

1616aff

knassre-bodo commented Mar 18, 2025

View reviewed changes

knassre-bodo added 4 commits March 18, 2025 12:24

[RUN CI]

02430ed

Added SQL refsols for tpch queries

d8f9509

Converting TPCH and Correl tests to SQL tests

d6fd261

Pushing logic into helper file

5c3b744

Base automatically changed from kian/decorell_opt to main March 19, 2025 15:28

knassre-bodo added 2 commits March 19, 2025 11:30

Resolving conflicts

56cb8f4

Fixing nondeterminism issues

044e8fa

knassre-bodo commented Mar 19, 2025

View reviewed changes

knassre-bodo added 2 commits March 19, 2025 12:13

Fixing bugs, updating tests, adding docstrings and readmes [RUN CI]

8436477

Fixing failing test [RUN CI]

c9b42dd

knassre-bodo requested review from vineetg3 and hadia206 March 19, 2025 16:45

knassre-bodo marked this pull request as ready for review March 19, 2025 16:45

knassre-bodo added 2 commits March 20, 2025 14:13

Removing SQL tests for correl files, adding them for defog queries

fc4cf07

[RUN CI]

bf47f87

vineetg3 reviewed Mar 20, 2025

View reviewed changes

pydough/conversion/filter_pushdown.py Show resolved Hide resolved

hadia206 approved these changes Mar 21, 2025

View reviewed changes

vineetg3 approved these changes Mar 21, 2025

View reviewed changes

knassre-bodo and others added 7 commits March 21, 2025 17:22

Update pydough/conversion/filter_pushdown.py

3ec99b7

Co-authored-by: Hadia Ahmed <[email protected]>

Update pydough/relational/rel_util.py

7ef4f65

Co-authored-by: Hadia Ahmed <[email protected]>

Merge branch 'main' into kian/filter_opt

c5f0025

Final revisions [RUN CI]

f3d4568

Merge remote-tracking branch 'origin/kian/filter_opt' into kian/filte…

43bf268

…r_opt

Final revisions [RUN CI]

4f47294

Updating plans [RUN CI]

b896d6e

knassre-bodo merged commit 6e77681 into main Mar 24, 2025
5 checks passed

knassre-bodo deleted the kian/filter_opt branch March 24, 2025 05:02

		FILTER(condition=size == 15:int64 & ENDSWITH(part_type, 'BRASS':string), columns={'key': key, 'manufacturer': manufacturer})
		SCAN(table=tpch.PART, columns={'key': p_partkey, 'manufacturer': p_mfgr, 'part_type': p_type, 'size': p_size})

		if new_column.input_name is not None:
		new_column = new_column.with_input(None)

		@@ -375,6 +375,38 @@ def defog_test_data(
		return request.param


		def test_defog_until_sql(

Add filter pushdown optimizations #300

Add filter pushdown optimizations #300

Uh oh!

Conversation

knassre-bodo commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knassre-bodo Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hadia206 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knassre-bodo Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vineetg3 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

knassre-bodo commented Mar 17, 2025 •

edited

Loading

knassre-bodo Mar 18, 2025 •

edited

Loading

knassre-bodo Mar 21, 2025 •

edited

Loading