Add support for sliding window function frames #346

knassre-bodo · 2025-05-19T17:59:31Z

Resolves #345. See issue for more details. Additional changes:

Changed the SQLGlot generation for intermediary SELECT clauses to sort the columns by the column alias (falling back on the entire column string if there isn't one) instead of just using the column string. This results in SQL that is sometimes easier to read when comparing to the original PyDough code.
Minor change to how the SQLGlot AST is generated for the RelationalRoot to avoid an extra SELECT clause just to add a redundant ORDER BY on top of a sub-query that already sorts by the same values.
Fixed the naming of the generated plan files for the EPOCH dataset to include the epoch_ prefix

…enamed epoch planner files

knassre-bodo · 2025-05-19T18:00:07Z

pydough/sqlglot/execute_relational.py

+    try:
+        glot_expr = apply_sqlglot_optimizer(glot_expr, relational, sqlglot_dialect)
+    except SqlglotError as e:
+        print(
+            f"ERROR WHILE OPTIMIZING QUERY:\n{glot_expr.sql(sqlglot_dialect, pretty=True)}"
+        )
+        raise e


This is just for aiding in debugging. We have this exact same pattern elsewhere (for the execution of the query) just copying here so we can see the entire query when this error pops up.

knassre-bodo · 2025-05-19T18:01:09Z

pydough/sqlglot/sqlglot_relational_visitor.py

+def expr_sort_key(expr: SQLGlotExpression) -> str:
+    """
+    A key function for sorting SQLGlot expressions. This is used to
+    ensure that the expressions are sorted in a consistent order.
+
+    Args:
+        expr (SQLGlotExpression): The expression to sort.
+
+    Returns:
+        str: The string representation of the expression.
+    """
+    if isinstance(expr, SQLGlotAlias):
+        return repr(expr.alias)
+    else:
+        return repr(expr)


Cosmetic effect: outside of the final SELECT clause, all other SELECT clauses are ordered alphabetically by alias (if they have one) instead of just by the entire expression's string representation. This makes it sometimes easier to read the SQL files and verify each clause in the select is what it should be, esp. for complicated things like window functions.

knassre-bodo · 2025-05-19T18:02:34Z

tests/test_sql_refsols/tpch_q5_sqlite.sql

 WITH _t1 AS (
  SELECT
-    MAX(nation.n_name) AS agg_3,
    SUM(lineitem.l_extendedprice * (
      1 - lineitem.l_discount
-    )) AS agg_0
+    )) AS agg_0,
+    MAX(nation.n_name) AS agg_3


Example of the cosmetic change: before it was sorting by the string MAX(...) AS agg_3 vs SUM(...) AS agg_0, now it is sorting by agg_3 vs agg_0

knassre-bodo · 2025-05-19T18:07:53Z

pydough/sqlglot/sqlglot_relational_visitor.py

+            # The columns used for ordering must only be noted as dependencies
+            # if the root has an ordering but the input expression does not.
+            ordering_dependencies: set[Identifier] = set()
+            if len(ordering_exprs) > 0 and "order" not in input_expr.args:
+                ordering_dependencies = find_identifiers_in_list(ordering_exprs)
            query = self._merge_selects(
-                exprs, input_expr, find_identifiers_in_list(ordering_exprs), sort=False
+                exprs, input_expr, ordering_dependencies, sort=False


This change makes it so we no longer have final queries like this:

with _t0 AS ( SELECT ... FROM ... ORDER BY x, y, z LIMIT 5 ) SELECT .. FROM _t0 ORDER BY x, y, z

Since they can be replaced with:

SELECT ... FROM ... ORDER BY x, y, z LIMIT 5

This method _merge_selects already performs the compaction discussed here, but not if any of the columns in the order by clause are NOT in the final select clause (since it is bad form to do select a from ... order by c). The third arguments to _merge_selects is used to pass in any columns like this.

The change here is we no longer pass in the columns used by the order by unless there is an order by in the root but NOT in the input. The effect here is if we get the pattern of double-orderby, we can now delete the 2nd one bc it is redundant (we don't have to worry about the order bys being different because that is what _is_mergeable_ordering checks for)

I would like to see that explanation in the code as a comment but in a shorter version somehow

I can expand on the comment on line 565-566 to say the intent.

knassre-bodo · 2025-05-19T18:08:23Z

tests/test_pipeline_epoch.py

+    unqualified_impl, test_name, _ = pydough_pipeline_test_data_epoch
+    file_name: str = f"epoch_{test_name}"


This change was made for consistency & clarity; now all of the generated files for the epoch tests have the same naming scheme.

knassre-bodo · 2025-05-19T18:09:22Z

tests/test_sql_refsols/epoch_overlapping_event_searches_per_user_sqlite.sql

-  ORDER BY
-    n_searches DESC,
-    agg_2
-  LIMIT 4
 )


This is an example of the cosmetic optimization created by the change to the call to _merge_selects

knassre-bodo · 2025-05-19T18:09:46Z

tests/test_sql_refsols/epoch_overlapping_event_searches_per_user_ansi.sql

-), _t0 AS (
-  SELECT
-    agg_2,
-    n_searches,
-    user_name
-  FROM _t1
-  ORDER BY
-    n_searches DESC,
-    agg_2
-  LIMIT 4


This is another example of the cosmetic optimization created by the change to the call to _merge_selects

knassre-bodo · 2025-05-19T18:11:42Z

tests/test_sql_refsols/window_sliding_frame_relsize_ansi.sql

+ORDER BY
+  sbtransaction.sbtxdatetime
+LIMIT 8


Before the two cosmetic changes made in this PR, this SQL file was extremely nasty to read for two reasons:

It had the same pattern of CTE with limit folowed by the redundant SELECT ... FROM ... ORDER BY

Within the CTE, the columns were in a crazy order because they were being sorted lexigraphically by the entire string, which meant that whatever was in the OVER (...) was being used to sort, instead of w1 vs w2 vs ...

Now, in this version, it is much easier to look at each column in this select clause and compare it 1:1 with the expression in the PyDough code.

That's cool

knassre-bodo · 2025-05-19T18:12:19Z

tests/test_sql_refsols/epoch_culture_events_info_ansi.sql

+  events.ev_dt
+LIMIT 6


This is another example of the cosmetic optimization created by the change to the call to _merge_selects

knassre-bodo · 2025-05-19T18:16:24Z

tests/test_sql_refsols/epoch_overlapping_event_search_other_users_per_user_ansi.sql

-), _t0 AS (
-  SELECT
-    agg_2,
-    n_other_users,
-    user_name
-  FROM _t1
-  ORDER BY
-    n_other_users DESC,
-    agg_2
-  LIMIT 7


This is another example of the cosmetic optimization created by the change to the call to _merge_selects

knassre-bodo · 2025-05-19T18:22:36Z

pydough/sqlglot/sqlglot_relational_expression_visitor.py

+        Args:
+            kwargs (dict[str, object]): The keyword arguments to parse, which
+            may include a `frame` argument or a `cumulative` argument. It is
+            assumed the keyword arguments are the correct types/formats.


This assumption is all of the checks in unqualified_node.py (verified in the new bad_window tests in test_unqualified_node.py

hadia206

Overall looks good to me.

documentation/functions.md

hadia206 · 2025-05-21T23:40:03Z

pydough/sqlglot/sqlglot_relational_expression_visitor.py

+        lower: int | None = None
+        upper: int | None = None
+        if kwargs.get("cumulative", False):
+            # If cumulative=True, that is the same as the frame (None, 0)


Shouldn't the comment be If cumulative=False? or is the if statement should be "cumulative", True?

The comment is correct, since this is what we do if kwargs contains cumulaative and its value is True. We use False as the default so that "cumulative isn't in the kwargs" has the same effect as "cumulative is in the kwargs but its value is False".

hadia206 · 2025-05-21T23:49:51Z

documentation/functions.md

- `cumulative` (optional): optional argument (default `False`) that can only be `True` if the `by` argument is provided. If `True`, then instead of returning the sum of all of the data in the context, returns the cumulative sum of all rows up to and including the current row when sorted according to the keys in the `by` argument.
+- `by` (optional): 1+ collation values, either as a single expression or an iterable of expressions, used to order the records of the current context. Can only be provided if `cumulative` is True or `frame` is provided.
+- `cumulative` (optional): optional argument (default `False`) that can only be `True` if the `by` argument is provided. If `True`, then instead of returning the sum of all of the data in the context, returns the cumulative sum of all rows up to and including the current row when sorted according to the keys in the `by` argument. This argument cannot be provided if `frame` is also provided.
+- `frame` (optional): optional argument that, if provided, must be a tuple of two values `(lower, upper)` that are each either integer literals or None. If provided, it means the relative sum is performed on a subset of the selected records based on the values of `lower` and `upper` when sorted according to the keys in the `by` argument:


We need to mention that if frame is provided, by must be provided as well.

tests/bad_pydough_functions.py

hadia206 · 2025-05-21T23:55:20Z

tests/test_logging.py

    SUM(l_extendedprice * (
      1 - l_discount
    ) * (
      1 + l_tax
    )) AS sum_charge,
+    SUM(l_discount) AS sum_discount,


Why these changes?

The same alphabetization of column order outside of the final select.

hadia206 · 2025-05-21T23:56:20Z

tests/test_sql_refsols/defog_broker_adv11_ansi.sql

@@ -1,7 +1,7 @@
 WITH _t1 AS (
  SELECT
-    sbcustomer.sbcustemail AS email,
-    sbcustomer.sbcustid AS _id
+    sbcustomer.sbcustid AS _id,


Why this change?

The same alphabetization of column order outside of the final select.

hadia206 · 2025-05-21T23:58:52Z

tests/test_sql_refsols/window_sliding_frame_relsize_ansi.sql

+ORDER BY
+  sbtransaction.sbtxdatetime
+LIMIT 8


That's cool

Hadia's revisions Co-authored-by: Hadia Ahmed <[email protected]>

Added sliding frame support, updated SQL plan generation cosmetics, r…

a0645ac

…enamed epoch planner files

knassre-bodo commented May 19, 2025

View reviewed changes

knassre-bodo added 2 commits May 19, 2025 14:20

Cleanup, comments, and additional error-handling tests

ef14335

Fixing docstring

d65b91c

knassre-bodo commented May 19, 2025

View reviewed changes

Adding documentation

8d560c9

knassre-bodo requested a review from hadia206 May 19, 2025 23:45

[RUN CI]

7cfd96f

knassre-bodo marked this pull request as ready for review May 19, 2025 23:46

Fixing test error [RUN CI]

9dc6734

hadia206 approved these changes May 21, 2025

View reviewed changes

knassre-bodo and others added 3 commits May 21, 2025 22:17

Apply suggestions from code review

9f3d88b

Hadia's revisions Co-authored-by: Hadia Ahmed <[email protected]>

Final revisions [RUN CI]'

31e9298

Final revision [RUN CI]

59c6191

knassre-bodo merged commit 54667ba into main May 22, 2025
5 checks passed

knassre-bodo deleted the kian/sliding_window branch May 22, 2025 07:07

		unqualified_impl, test_name, _ = pydough_pipeline_test_data_epoch
		file_name: str = f"epoch_{test_name}"

Add support for sliding window function frames #346

Add support for sliding window function frames #346

Uh oh!

Conversation

knassre-bodo commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

knassre-bodo May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knassre-bodo May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knassre-bodo May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hadia206 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

knassre-bodo commented May 19, 2025 •

edited

Loading

knassre-bodo May 19, 2025 •

edited

Loading

knassre-bodo May 19, 2025 •

edited

Loading

knassre-bodo May 19, 2025 •

edited

Loading