-
Notifications
You must be signed in to change notification settings - Fork 3
Add QUANTILE function to PyDough #378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job @john-sanchez31, just a few comments and then it will be ready to merge
|
||
> [!NOTE] | ||
> `QUANTILE(X, 0.5)` is equivalent to `MEDIAN(X)`. | ||
> The implementation uses the SQL standard `PERCENTILE_DISC` aggregate function where available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better phrasing: it is equivalent to the common PERCENTILE_DISC
SQL aggregation function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
documentation/functions.md
Outdated
- If the quantile argument is not a valid number between 0 and 1, an error is raised. | ||
|
||
> [!NOTE] | ||
> `QUANTILE(X, 0.5)` is equivalent to `MEDIAN(X)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect, it isn't actually equivalent to MEDIAN
it is just very similar to MEDIAN
. The difference is that MEDIAN
will, if there is an even number of rows, take the average of the two median rows, but QUANTILE
(aka PERCENTILE_DISC
) always selects a row instead of interpolating between rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made the corrections
documentation/functions.md
Outdated
# Returns the value at the 90th percentile of supply costs for each supplier | ||
Suppliers.CALCULATE(ninetieth_percentile_cost = QUANTILE(supply_records.supply_cost, 0.9)) | ||
|
||
# Returns the median supply cost for each supplier (equivalent to MEDIAN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again don't say equivalent median, say the 50th percentile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed as well
documentation/functions.md
Outdated
|
||
### QUANTILE | ||
|
||
The `QUANTILE` function returns the value at a specified quantile from the set of values it is called on. The quantile value `p` must be a numeric literal between 0 and 1 (inclusive), where `0` returns the minimum, `1` returns the maximum, and `0.5` returns the median. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to be very specific about what kind of quantile you are talking about here, since there are many different kinds of quantile computations in math/analytics. Specifically:
- When doing
QUANTILE(x, p)
, returns the smallest value ofx
such thatp
of the rows are less than or equal to it (this is whatPERCENTILE_DISC
is) - Ignores
NULL
records in computation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Rewrites a QUANTILE aggregation call into an equivalent expression using window functions. | ||
This is typically used for dialects that do not natively support the PERCENTILE_DISC | ||
aggregate function. | ||
|
||
The rewritten expression selects the value at the specified quantile by: | ||
- Ranking the rows within each partition. | ||
- Calculating the number of rows (N) in each partition. | ||
- Keeping only those rows where the rank is greater than INTEGER((1.0 - p) * N), | ||
where p is the quantile argument. | ||
- Taking the maximum value among the kept rows. | ||
|
||
Args: | ||
child_connection: The HybridConnection containing the aggregate call to QUANTILE. | ||
expr: The HybridFunctionExpr representing the QUANTILE aggregation. | ||
create_new_calc: If True, injects new expressions into a new CALCULATE operation. | ||
|
||
Returns: | ||
A HybridFunctionExpr representing the rewritten aggregation using window functions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure none of these lines are over 80 characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
) | ||
|
||
# (1.0-args[1]) | ||
sub: HybridExpr = HybridFunctionExpr(pydop.SUB, [one, p], NumericType()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we know p
is a numeric literal, we can forgo this expr entirely and just pass in the new literal:
sub: HybridExpr = HybridFunctionExpr(pydop.SUB, [one, p], NumericType()) | |
sub: HybridExpr = HybridLiteralExpr(Literal(1.0 - p.literal.value, NumericType())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed. It changed the sql, just make sure everything still looks fine
Converts a PyDough QUANTILE(X, p) function call to a SQLGlot expression | ||
representing the SQL standard PERCENTILE_DISC aggregate function. | ||
|
||
This produces an expression equivalent to: | ||
PERCENTILE_DISC(p) WITHIN GROUP (ORDER BY X) | ||
|
||
Args: | ||
args: A list of two SQLGlot expressions, where args[0] is the column or | ||
expression to order by (X), and args[1] is the quantile value (p) between 0 and 1. | ||
types: The PyDough types of the arguments. | ||
|
||
Returns: | ||
A SQLGlotExpression representing the PERCENTILE_DISC(p) WITHIN GROUP (ORDER BY X) | ||
aggregate function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
80 character lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
if ( | ||
not isinstance(args[1], sqlglot_expressions.Literal) | ||
or args[1].is_string | ||
# or not isinstance(args[1].this, (int, float)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either fix or delete this, but don't leave the comment as dead code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted
|
||
assert len(args) == 2 | ||
|
||
# validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment should be clearer: # Validate that the second argument is a numeric literal between 0 and 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments remaining while you address the testing
Rewrites a QUANTILE aggregation call into an equivalent expression using | ||
window functions. | ||
This is typically used for dialects that do not natively support the | ||
PERCENTILE_DISC | ||
aggregate function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The newlines here are a bit off. Also, wrap functions like QUANTILE
or PERCENTILE_DISC
in backticks (`) for tooltip formatting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address this before merging
WITH _s0 AS ( | ||
SELECT | ||
n_name, | ||
n_nationkey, | ||
n_regionkey | ||
FROM tpch.nation | ||
ORDER BY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to delete the test_3
and test_4
SQL files since they are no longer part of test_pydough_to_sql.py
.
tests/test_pipeline_tpch_custom.py
Outdated
PyDoughPandasTest( | ||
quantile_function_test_1, | ||
"TPCH", | ||
# Answer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to write # Answer
here every time
Rewrites a QUANTILE aggregation call into an equivalent expression using | ||
window functions. | ||
This is typically used for dialects that do not natively support the | ||
PERCENTILE_DISC | ||
aggregate function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address this before merging
Resolves issue #373