-
Notifications
You must be signed in to change notification settings - Fork 3
Add relational expression simplification with predicate inference #396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… performant with fewer window functions [RUN CI]
@@ -1531,6 +1540,9 @@ def convert_ast_to_relational( | |||
raw_result: RelationalRoot = postprocess_root(node, columns, hybrid, output) | |||
|
|||
# Invoke the optimization procedures on the result to clean up the tree. | |||
optimized_result: RelationalRoot = optimize_relational_tree(raw_result, configs) | |||
additional_shuttles: list[RelationalExpressionShuttle] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list will be used down the lines for when additional configurations are set that allow more specific/advanced transformations to happen during simplification.
" s03 = DEFAULT_TO(None, 0) > 0," # -> False | ||
" s04 = DEFAULT_TO(None, 0) <= 0," # -> True | ||
" s05 = DEFAULT_TO(None, 0) < 0," # -> False | ||
" s06 = DEFAULT_TO(None, 0) == None," # -> None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For tests from s06
to s11
shouldn't we expect True/False values ? Why are they None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x == NULL
is always NULL, the same goes for other operators. You can test it in SQL. Some dialects have "null safe" variants (for example, in Snowflake you can do EQUAL_NULL(1, NULL)
which is False, or EQUAL_NULL(NULL, NULL)
which is True)
tests/test_pipeline_defog_custom.py
Outdated
" s03 = MONOTONIC(1, 4, 3)," # -> False | ||
" s04 = MONOTONIC(1, 2, 1)," # -> False | ||
" s05 = MONOTONIC(1, 0, 1)," # -> False | ||
" s06 = MONOTONIC(1, LENGTH('foo'), COUNT(customers))," # -> 3 <= COUNT(customers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the comment of s06
and s08
comments accurate? Taking in count that below we are expecting 1
and '0' respectively
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments indicate what they get simplified to, not what they evaluate to. The simplified expression still needs to get evaluated. In this case, s06
simplifies to MONOTONIC(1, 3, COUNT(customers))
, which is equivalent to (1 <= 3) & (3 <= COUNT(customers)
, which simplifies to True & (3 <= COUNT(customers))
, which just becomes 3 <= COUNT(customers)
.
Also there is no True/False in sqlite, it just returns 0/1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall code seems to do what the PR description is saying.
Please see my comments below, I'm mainly looking for some more tests.
if len(predicates) == 0: | ||
return result | ||
else: | ||
result |= predicates[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be result = predicates[0]
?
My understanding is that result |= predicates[0]
will be a no-op since result is the default PredicateSet(), and |
will be a union
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I don't want the intersect
operator to actually mutate any of the input sets (since everything is mutable). If I do result |= predicates[0]
then in the next section do result &= pred
, this will mutate both result and predicates[0]
, which is a problem if predicates[0]
is being used elsewhere.
""" | ||
result: PredicateSet = PredicateSet() | ||
if len(predicates) == 0: | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that man both union and intersection with empty set will have the same result value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is saying if we call it on an empty list of sets, the output assumption is that we know nothing.
output_predicates: PredicateSet = PredicateSet() | ||
if literal_expression.value is not None: | ||
output_predicates.not_null = True | ||
if isinstance(literal_expression.value, (int, float)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about booleans? 0/1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point
if literal_expression.value >= 0: | ||
output_predicates.not_negative = True | ||
if literal_expression.value > 0: | ||
output_predicates.positive = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a default or fallback for other types? (string, boolean, dates, ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We start with the default predicate set (we know nothing to be true) and if the value happens to be in these cases then we add predicates to it. The default is we don't enter these cases so we just return the empty predicate set.
"simplification_3", | ||
), | ||
id="simplification_3", | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other tests to include
- Tests for ADD, MUL, DIV simplification outside ABS and DEFAULT_TO
- Test multiplication by negative or zero explicitly to confirm positive/not-negative predicates.
- Tests for COUNT(*), COUNT(non-null column), and their positive/non-null predicates outside ABS and DEFAULT_TO
- Logical operators with mixed literal and non-literal inputs
- XOR (BXR) and LIKE operators
- Tests for Window call operators with and without frames
- Tests for aggregate functions
- Tests where expressions are nested multiple levels to verify recursive simplification works correctly. Example:
ABS(ADD(COUNT(customers), DEFAULT_TO(None, 5))).
- Empty inputs, zero-length strings in string functions.
DEFAULT_TO
with more than two arguments.- Boolean operators with
None
or unexpected values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests for ADD, MUL, DIV simplification outside ABS and DEFAULT_TO
Sure thing
Tests for COUNT(*), COUNT(non-null column), and their positive/non-null predicates outside ABS and DEFAULT_TO
There isn't really a way to test their positive/non-null predicates w/o ABS
/ DEFAULT_TO
(or similarly sensitive operators like boolean operators). The point of those two is that they only get simplified correctly if the predicates to their arguments are correct.
Logical operators with mixed literal and non-literal inputs
We have some of these already
Tests where expressions are nested multiple levels to verify recursive simplification works correctly
We have some of those already: MONOTONIC(1, LENGTH('foo'), COUNT(customers))
, ABS(DEFAULT_TO(AVG(ABS(DEFAULT_TO(LENGTH(customers.name), 0))), 0))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I now got most of the ones you suggested in some form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Kian
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine
Extension of #396 to further the transformation of LEFT joins into INNER joins when there are filters on top of the join if the filters would be false if the RHS columns were all-null (relies on simplification to deduce if this is the case). Also creates a new `RelationalShuttle` base class so relational nodes can use the shuttle design pattern, and refactors filter pushdown to use this pattern. Also refactors several other utilities to use shuttles that should have been doing so already.
Adds a simplification pass in the relational optimizer that simplifies relational expressions using predicate inference & constant folding. Built via a relational visitor which modifies the relational nodes in place by invoking a relational shuttle that transforms the expressions, both of which propagate information via
PredicateSet
objects which contain information about whether an expression matches certain criteria (positive, non-negative, non-null).Also updates the testing infrastructure in several ways:
PyDoughPandasTest
can now take in a string instead of a function (evaluated viapydough.from_string
).args
argument toPyDoughPandasTest
is replaced withkwargs
, passed in as keyword arguments to functions or as an environment topydough.from_string
.