-
Notifications
You must be signed in to change notification settings - Fork 3
Add support for sliding window function frames #346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…enamed epoch planner files
try: | ||
glot_expr = apply_sqlglot_optimizer(glot_expr, relational, sqlglot_dialect) | ||
except SqlglotError as e: | ||
print( | ||
f"ERROR WHILE OPTIMIZING QUERY:\n{glot_expr.sql(sqlglot_dialect, pretty=True)}" | ||
) | ||
raise e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just for aiding in debugging. We have this exact same pattern elsewhere (for the execution of the query) just copying here so we can see the entire query when this error pops up.
def expr_sort_key(expr: SQLGlotExpression) -> str: | ||
""" | ||
A key function for sorting SQLGlot expressions. This is used to | ||
ensure that the expressions are sorted in a consistent order. | ||
|
||
Args: | ||
expr (SQLGlotExpression): The expression to sort. | ||
|
||
Returns: | ||
str: The string representation of the expression. | ||
""" | ||
if isinstance(expr, SQLGlotAlias): | ||
return repr(expr.alias) | ||
else: | ||
return repr(expr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cosmetic effect: outside of the final SELECT
clause, all other SELECT
clauses are ordered alphabetically by alias (if they have one) instead of just by the entire expression's string representation. This makes it sometimes easier to read the SQL files and verify each clause in the select is what it should be, esp. for complicated things like window functions.
WITH _t1 AS ( | ||
SELECT | ||
MAX(nation.n_name) AS agg_3, | ||
SUM(lineitem.l_extendedprice * ( | ||
1 - lineitem.l_discount | ||
)) AS agg_0 | ||
)) AS agg_0, | ||
MAX(nation.n_name) AS agg_3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example of the cosmetic change: before it was sorting by the string MAX(...) AS agg_3
vs SUM(...) AS agg_0
, now it is sorting by agg_3
vs agg_0
# The columns used for ordering must only be noted as dependencies | ||
# if the root has an ordering but the input expression does not. | ||
ordering_dependencies: set[Identifier] = set() | ||
if len(ordering_exprs) > 0 and "order" not in input_expr.args: | ||
ordering_dependencies = find_identifiers_in_list(ordering_exprs) | ||
query = self._merge_selects( | ||
exprs, input_expr, find_identifiers_in_list(ordering_exprs), sort=False | ||
exprs, input_expr, ordering_dependencies, sort=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change makes it so we no longer have final queries like this:
with _t0 AS (
SELECT ...
FROM ...
ORDER BY x, y, z
LIMIT 5
)
SELECT ..
FROM _t0
ORDER BY x, y, z
Since they can be replaced with:
SELECT ...
FROM ...
ORDER BY x, y, z
LIMIT 5
This method _merge_selects
already performs the compaction discussed here, but not if any of the columns in the order by clause are NOT in the final select clause (since it is bad form to do select a from ... order by c
). The third arguments to _merge_selects
is used to pass in any columns like this.
The change here is we no longer pass in the columns used by the order by
unless there is an order by
in the root but NOT in the input. The effect here is if we get the pattern of double-orderby, we can now delete the 2nd one bc it is redundant (we don't have to worry about the order bys being different because that is what _is_mergeable_ordering
checks for)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see that explanation in the code as a comment but in a shorter version somehow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can expand on the comment on line 565-566 to say the intent.
unqualified_impl, test_name, _ = pydough_pipeline_test_data_epoch | ||
file_name: str = f"epoch_{test_name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was made for consistency & clarity; now all of the generated files for the epoch tests have the same naming scheme.
ORDER BY | ||
n_searches DESC, | ||
agg_2 | ||
LIMIT 4 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example of the cosmetic optimization created by the change to the call to _merge_selects
), _t0 AS ( | ||
SELECT | ||
agg_2, | ||
n_searches, | ||
user_name | ||
FROM _t1 | ||
ORDER BY | ||
n_searches DESC, | ||
agg_2 | ||
LIMIT 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another example of the cosmetic optimization created by the change to the call to _merge_selects
ORDER BY | ||
sbtransaction.sbtxdatetime | ||
LIMIT 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before the two cosmetic changes made in this PR, this SQL file was extremely nasty to read for two reasons:
- It had the same pattern of CTE with limit folowed by the redundant
SELECT ... FROM ... ORDER BY
- Within the CTE, the columns were in a crazy order because they were being sorted lexigraphically by the entire string, which meant that whatever was in the
OVER (...)
was being used to sort, instead ofw1
vsw2
vs ...
Now, in this version, it is much easier to look at each column in this select clause and compare it 1:1 with the expression in the PyDough code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's cool
events.ev_dt | ||
LIMIT 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another example of the cosmetic optimization created by the change to the call to _merge_selects
), _t0 AS ( | ||
SELECT | ||
agg_2, | ||
n_other_users, | ||
user_name | ||
FROM _t1 | ||
ORDER BY | ||
n_other_users DESC, | ||
agg_2 | ||
LIMIT 7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another example of the cosmetic optimization created by the change to the call to _merge_selects
Args: | ||
kwargs (dict[str, object]): The keyword arguments to parse, which | ||
may include a `frame` argument or a `cumulative` argument. It is | ||
assumed the keyword arguments are the correct types/formats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumption is all of the checks in unqualified_node.py
(verified in the new bad_window
tests in test_unqualified_node.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me.
lower: int | None = None | ||
upper: int | None = None | ||
if kwargs.get("cumulative", False): | ||
# If cumulative=True, that is the same as the frame (None, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the comment be If cumulative=False
? or is the if statement should be "cumulative", True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is correct, since this is what we do if kwargs
contains cumulaative
and its value is True. We use False
as the default so that "cumulative isn't in the kwargs" has the same effect as "cumulative is in the kwargs but its value is False".
documentation/functions.md
Outdated
- `cumulative` (optional): optional argument (default `False`) that can only be `True` if the `by` argument is provided. If `True`, then instead of returning the sum of all of the data in the context, returns the cumulative sum of all rows up to and including the current row when sorted according to the keys in the `by` argument. | ||
- `by` (optional): 1+ collation values, either as a single expression or an iterable of expressions, used to order the records of the current context. Can only be provided if `cumulative` is True or `frame` is provided. | ||
- `cumulative` (optional): optional argument (default `False`) that can only be `True` if the `by` argument is provided. If `True`, then instead of returning the sum of all of the data in the context, returns the cumulative sum of all rows up to and including the current row when sorted according to the keys in the `by` argument. This argument cannot be provided if `frame` is also provided. | ||
- `frame` (optional): optional argument that, if provided, must be a tuple of two values `(lower, upper)` that are each either integer literals or None. If provided, it means the relative sum is performed on a subset of the selected records based on the values of `lower` and `upper` when sorted according to the keys in the `by` argument: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to mention that if frame
is provided, by
must be provided as well.
SUM(l_extendedprice * ( | ||
1 - l_discount | ||
) * ( | ||
1 + l_tax | ||
)) AS sum_charge, | ||
SUM(l_discount) AS sum_discount, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why these changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same alphabetization of column order outside of the final select.
@@ -1,7 +1,7 @@ | |||
WITH _t1 AS ( | |||
SELECT | |||
sbcustomer.sbcustemail AS email, | |||
sbcustomer.sbcustid AS _id | |||
sbcustomer.sbcustid AS _id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same alphabetization of column order outside of the final select.
ORDER BY | ||
sbtransaction.sbtxdatetime | ||
LIMIT 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's cool
Hadia's revisions Co-authored-by: Hadia Ahmed <[email protected]>
Resolves #345. See issue for more details. Additional changes:
SELECT
clauses to sort the columns by the column alias (falling back on the entire column string if there isn't one) instead of just using the column string. This results in SQL that is sometimes easier to read when comparing to the original PyDough code.RelationalRoot
to avoid an extraSELECT
clause just to add a redundantORDER BY
on top of a sub-query that already sorts by the same values.epoch_
prefix