Skip to content

Add support for sliding window function frames #346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 22, 2025

Conversation

knassre-bodo
Copy link
Contributor

@knassre-bodo knassre-bodo commented May 19, 2025

Resolves #345. See issue for more details. Additional changes:

  • Changed the SQLGlot generation for intermediary SELECT clauses to sort the columns by the column alias (falling back on the entire column string if there isn't one) instead of just using the column string. This results in SQL that is sometimes easier to read when comparing to the original PyDough code.
  • Minor change to how the SQLGlot AST is generated for the RelationalRoot to avoid an extra SELECT clause just to add a redundant ORDER BY on top of a sub-query that already sorts by the same values.
  • Fixed the naming of the generated plan files for the EPOCH dataset to include the epoch_ prefix

Comment on lines +64 to +70
try:
glot_expr = apply_sqlglot_optimizer(glot_expr, relational, sqlglot_dialect)
except SqlglotError as e:
print(
f"ERROR WHILE OPTIMIZING QUERY:\n{glot_expr.sql(sqlglot_dialect, pretty=True)}"
)
raise e
Copy link
Contributor Author

@knassre-bodo knassre-bodo May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just for aiding in debugging. We have this exact same pattern elsewhere (for the execution of the query) just copying here so we can see the entire query when this error pops up.

Comment on lines +48 to +62
def expr_sort_key(expr: SQLGlotExpression) -> str:
"""
A key function for sorting SQLGlot expressions. This is used to
ensure that the expressions are sorted in a consistent order.

Args:
expr (SQLGlotExpression): The expression to sort.

Returns:
str: The string representation of the expression.
"""
if isinstance(expr, SQLGlotAlias):
return repr(expr.alias)
else:
return repr(expr)
Copy link
Contributor Author

@knassre-bodo knassre-bodo May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cosmetic effect: outside of the final SELECT clause, all other SELECT clauses are ordered alphabetically by alias (if they have one) instead of just by the entire expression's string representation. This makes it sometimes easier to read the SQL files and verify each clause in the select is what it should be, esp. for complicated things like window functions.

Comment on lines 1 to +6
WITH _t1 AS (
SELECT
MAX(nation.n_name) AS agg_3,
SUM(lineitem.l_extendedprice * (
1 - lineitem.l_discount
)) AS agg_0
)) AS agg_0,
MAX(nation.n_name) AS agg_3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example of the cosmetic change: before it was sorting by the string MAX(...) AS agg_3 vs SUM(...) AS agg_0, now it is sorting by agg_3 vs agg_0

Comment on lines 565 to 571
# The columns used for ordering must only be noted as dependencies
# if the root has an ordering but the input expression does not.
ordering_dependencies: set[Identifier] = set()
if len(ordering_exprs) > 0 and "order" not in input_expr.args:
ordering_dependencies = find_identifiers_in_list(ordering_exprs)
query = self._merge_selects(
exprs, input_expr, find_identifiers_in_list(ordering_exprs), sort=False
exprs, input_expr, ordering_dependencies, sort=False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes it so we no longer have final queries like this:

with _t0 AS (
SELECT ...
  FROM ...
  ORDER BY x, y, z
  LIMIT 5
)
SELECT ..
FROM _t0
ORDER BY x, y, z

Since they can be replaced with:

SELECT ...
FROM ...
ORDER BY x, y, z
LIMIT 5

This method _merge_selects already performs the compaction discussed here, but not if any of the columns in the order by clause are NOT in the final select clause (since it is bad form to do select a from ... order by c). The third arguments to _merge_selects is used to pass in any columns like this.

The change here is we no longer pass in the columns used by the order by unless there is an order by in the root but NOT in the input. The effect here is if we get the pattern of double-orderby, we can now delete the 2nd one bc it is redundant (we don't have to worry about the order bys being different because that is what _is_mergeable_ordering checks for)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see that explanation in the code as a comment but in a shorter version somehow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can expand on the comment on line 565-566 to say the intent.

Comment on lines +403 to +404
unqualified_impl, test_name, _ = pydough_pipeline_test_data_epoch
file_name: str = f"epoch_{test_name}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was made for consistency & clarity; now all of the generated files for the epoch tests have the same naming scheme.

ORDER BY
n_searches DESC,
agg_2
LIMIT 4
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example of the cosmetic optimization created by the change to the call to _merge_selects

Comment on lines -29 to -38
), _t0 AS (
SELECT
agg_2,
n_searches,
user_name
FROM _t1
ORDER BY
n_searches DESC,
agg_2
LIMIT 4
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another example of the cosmetic optimization created by the change to the call to _merge_selects

Comment on lines +14 to +16
ORDER BY
sbtransaction.sbtxdatetime
LIMIT 8
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the two cosmetic changes made in this PR, this SQL file was extremely nasty to read for two reasons:

  • It had the same pattern of CTE with limit folowed by the redundant SELECT ... FROM ... ORDER BY
  • Within the CTE, the columns were in a crazy order because they were being sorted lexigraphically by the entire string, which meant that whatever was in the OVER (...) was being used to sort, instead of w1 vs w2 vs ...

Now, in this version, it is much easier to look at each column in this select clause and compare it 1:1 with the expression in the PyDough code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's cool

Comment on lines +21 to +22
events.ev_dt
LIMIT 6
Copy link
Contributor Author

@knassre-bodo knassre-bodo May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another example of the cosmetic optimization created by the change to the call to _merge_selects

Comment on lines -22 to -31
), _t0 AS (
SELECT
agg_2,
n_other_users,
user_name
FROM _t1
ORDER BY
n_other_users DESC,
agg_2
LIMIT 7
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another example of the cosmetic optimization created by the change to the call to _merge_selects

Comment on lines +91 to +94
Args:
kwargs (dict[str, object]): The keyword arguments to parse, which
may include a `frame` argument or a `cumulative` argument. It is
assumed the keyword arguments are the correct types/formats.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumption is all of the checks in unqualified_node.py (verified in the new bad_window tests in test_unqualified_node.py

@knassre-bodo knassre-bodo requested a review from hadia206 May 19, 2025 23:45
@knassre-bodo knassre-bodo marked this pull request as ready for review May 19, 2025 23:46
Copy link
Contributor

@hadia206 hadia206 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me.

lower: int | None = None
upper: int | None = None
if kwargs.get("cumulative", False):
# If cumulative=True, that is the same as the frame (None, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the comment be If cumulative=False? or is the if statement should be "cumulative", True?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is correct, since this is what we do if kwargs contains cumulaative and its value is True. We use False as the default so that "cumulative isn't in the kwargs" has the same effect as "cumulative is in the kwargs but its value is False".

- `cumulative` (optional): optional argument (default `False`) that can only be `True` if the `by` argument is provided. If `True`, then instead of returning the sum of all of the data in the context, returns the cumulative sum of all rows up to and including the current row when sorted according to the keys in the `by` argument.
- `by` (optional): 1+ collation values, either as a single expression or an iterable of expressions, used to order the records of the current context. Can only be provided if `cumulative` is True or `frame` is provided.
- `cumulative` (optional): optional argument (default `False`) that can only be `True` if the `by` argument is provided. If `True`, then instead of returning the sum of all of the data in the context, returns the cumulative sum of all rows up to and including the current row when sorted according to the keys in the `by` argument. This argument cannot be provided if `frame` is also provided.
- `frame` (optional): optional argument that, if provided, must be a tuple of two values `(lower, upper)` that are each either integer literals or None. If provided, it means the relative sum is performed on a subset of the selected records based on the values of `lower` and `upper` when sorted according to the keys in the `by` argument:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to mention that if frame is provided, by must be provided as well.

SUM(l_extendedprice * (
1 - l_discount
) * (
1 + l_tax
)) AS sum_charge,
SUM(l_discount) AS sum_discount,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why these changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same alphabetization of column order outside of the final select.

@@ -1,7 +1,7 @@
WITH _t1 AS (
SELECT
sbcustomer.sbcustemail AS email,
sbcustomer.sbcustid AS _id
sbcustomer.sbcustid AS _id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same alphabetization of column order outside of the final select.

Comment on lines +14 to +16
ORDER BY
sbtransaction.sbtxdatetime
LIMIT 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's cool

@knassre-bodo knassre-bodo merged commit 54667ba into main May 22, 2025
5 checks passed
@knassre-bodo knassre-bodo deleted the kian/sliding_window branch May 22, 2025 07:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add window function frame feature for relsum/relavg/relcount/relsize
2 participants