Skip to content

Add relational expression simplification with predicate inference #396

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 79 commits into from
Aug 1, 2025

Conversation

knassre-bodo
Copy link
Contributor

@knassre-bodo knassre-bodo commented Jul 19, 2025

Adds a simplification pass in the relational optimizer that simplifies relational expressions using predicate inference & constant folding. Built via a relational visitor which modifies the relational nodes in place by invoking a relational shuttle that transforms the expressions, both of which propagate information via PredicateSet objects which contain information about whether an expression matches certain criteria (positive, non-negative, non-null).

Also updates the testing infrastructure in several ways:

  • PyDoughPandasTest can now take in a string instead of a function (evaluated via pydough.from_string).
  • The args argument to PyDoughPandasTest is replaced with kwargs, passed in as keyword arguments to functions or as an environment to pydough.from_string.

@knassre-bodo knassre-bodo marked this pull request as ready for review July 24, 2025 05:45
@@ -1531,6 +1540,9 @@ def convert_ast_to_relational(
raw_result: RelationalRoot = postprocess_root(node, columns, hybrid, output)

# Invoke the optimization procedures on the result to clean up the tree.
optimized_result: RelationalRoot = optimize_relational_tree(raw_result, configs)
additional_shuttles: list[RelationalExpressionShuttle] = []
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list will be used down the lines for when additional configurations are set that allow more specific/advanced transformations to happen during simplification.

" s03 = DEFAULT_TO(None, 0) > 0," # -> False
" s04 = DEFAULT_TO(None, 0) <= 0," # -> True
" s05 = DEFAULT_TO(None, 0) < 0," # -> False
" s06 = DEFAULT_TO(None, 0) == None," # -> None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For tests from s06 to s11 shouldn't we expect True/False values ? Why are they None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x == NULL is always NULL, the same goes for other operators. You can test it in SQL. Some dialects have "null safe" variants (for example, in Snowflake you can do EQUAL_NULL(1, NULL) which is False, or EQUAL_NULL(NULL, NULL) which is True)

" s03 = MONOTONIC(1, 4, 3)," # -> False
" s04 = MONOTONIC(1, 2, 1)," # -> False
" s05 = MONOTONIC(1, 0, 1)," # -> False
" s06 = MONOTONIC(1, LENGTH('foo'), COUNT(customers))," # -> 3 <= COUNT(customers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the comment of s06 and s08 comments accurate? Taking in count that below we are expecting 1 and '0' respectively

Copy link
Contributor Author

@knassre-bodo knassre-bodo Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments indicate what they get simplified to, not what they evaluate to. The simplified expression still needs to get evaluated. In this case, s06 simplifies to MONOTONIC(1, 3, COUNT(customers)), which is equivalent to (1 <= 3) & (3 <= COUNT(customers), which simplifies to True & (3 <= COUNT(customers)), which just becomes 3 <= COUNT(customers).

Also there is no True/False in sqlite, it just returns 0/1.

Copy link
Contributor

@hadia206 hadia206 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall code seems to do what the PR description is saying.
Please see my comments below, I'm mainly looking for some more tests.

if len(predicates) == 0:
return result
else:
result |= predicates[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be result = predicates[0]?
My understanding is that result |= predicates[0] will be a no-op since result is the default PredicateSet(), and | will be a union

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I don't want the intersect operator to actually mutate any of the input sets (since everything is mutable). If I do result |= predicates[0] then in the next section do result &= pred, this will mutate both result and predicates[0], which is a problem if predicates[0] is being used elsewhere.

"""
result: PredicateSet = PredicateSet()
if len(predicates) == 0:
return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that man both union and intersection with empty set will have the same result value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is saying if we call it on an empty list of sets, the output assumption is that we know nothing.

output_predicates: PredicateSet = PredicateSet()
if literal_expression.value is not None:
output_predicates.not_null = True
if isinstance(literal_expression.value, (int, float)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about booleans? 0/1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

if literal_expression.value >= 0:
output_predicates.not_negative = True
if literal_expression.value > 0:
output_predicates.positive = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a default or fallback for other types? (string, boolean, dates, ...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We start with the default predicate set (we know nothing to be true) and if the value happens to be in these cases then we add predicates to it. The default is we don't enter these cases so we just return the empty predicate set.

"simplification_3",
),
id="simplification_3",
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other tests to include

  • Tests for ADD, MUL, DIV simplification outside ABS and DEFAULT_TO
  • Test multiplication by negative or zero explicitly to confirm positive/not-negative predicates.
  • Tests for COUNT(*), COUNT(non-null column), and their positive/non-null predicates outside ABS and DEFAULT_TO
  • Logical operators with mixed literal and non-literal inputs
  • XOR (BXR) and LIKE operators
  • Tests for Window call operators with and without frames
  • Tests for aggregate functions
  • Tests where expressions are nested multiple levels to verify recursive simplification works correctly. Example: ABS(ADD(COUNT(customers), DEFAULT_TO(None, 5))).
  • Empty inputs, zero-length strings in string functions.
  • DEFAULT_TO with more than two arguments.
  • Boolean operators with None or unexpected values.

Copy link
Contributor Author

@knassre-bodo knassre-bodo Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests for ADD, MUL, DIV simplification outside ABS and DEFAULT_TO

Sure thing

Tests for COUNT(*), COUNT(non-null column), and their positive/non-null predicates outside ABS and DEFAULT_TO

There isn't really a way to test their positive/non-null predicates w/o ABS / DEFAULT_TO (or similarly sensitive operators like boolean operators). The point of those two is that they only get simplified correctly if the predicates to their arguments are correct.

Logical operators with mixed literal and non-literal inputs

We have some of these already

Tests where expressions are nested multiple levels to verify recursive simplification works correctly

We have some of those already: MONOTONIC(1, LENGTH('foo'), COUNT(customers)), ABS(DEFAULT_TO(AVG(ABS(DEFAULT_TO(LENGTH(customers.name), 0))), 0))

Copy link
Contributor Author

@knassre-bodo knassre-bodo Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I now got most of the ones you suggested in some form.

Copy link
Contributor

@hadia206 hadia206 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Kian

Copy link
Contributor

@john-sanchez31 john-sanchez31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine

@knassre-bodo knassre-bodo merged commit d7612bc into main Aug 1, 2025
7 checks passed
@knassre-bodo knassre-bodo deleted the kian/simplify branch August 1, 2025 15:06
knassre-bodo added a commit that referenced this pull request Aug 7, 2025
Extension of #396 to further the transformation of LEFT joins into INNER joins when there are filters on top of the join if the filters would be false if the RHS columns were all-null (relies on simplification to deduce if this is the case). Also creates a new `RelationalShuttle` base class so relational nodes can use the shuttle design pattern, and refactors filter pushdown to use this pattern. Also refactors several other utilities to use shuttles that should have been doing so already.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants