Skip to content

Add QUANTILE function to PyDough #378

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jul 3, 2025
Merged

Add QUANTILE function to PyDough #378

merged 18 commits into from
Jul 3, 2025

Conversation

john-sanchez31
Copy link
Contributor

Resolves issue #373

@john-sanchez31 john-sanchez31 self-assigned this Jun 20, 2025
@john-sanchez31 john-sanchez31 added documentation Improvements or additions to documentation extensibility Increasing situations in which PyDough works effort - medium mid-sized issue with average implementation time/difficulty labels Jun 20, 2025
@john-sanchez31 john-sanchez31 linked an issue Jun 20, 2025 that may be closed by this pull request
@john-sanchez31 john-sanchez31 marked this pull request as ready for review June 25, 2025 22:03
Copy link
Contributor

@knassre-bodo knassre-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @john-sanchez31, just a few comments and then it will be ready to merge


> [!NOTE]
> `QUANTILE(X, 0.5)` is equivalent to `MEDIAN(X)`.
> The implementation uses the SQL standard `PERCENTILE_DISC` aggregate function where available.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better phrasing: it is equivalent to the common PERCENTILE_DISC SQL aggregation function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

- If the quantile argument is not a valid number between 0 and 1, an error is raised.

> [!NOTE]
> `QUANTILE(X, 0.5)` is equivalent to `MEDIAN(X)`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect, it isn't actually equivalent to MEDIAN it is just very similar to MEDIAN. The difference is that MEDIAN will, if there is an even number of rows, take the average of the two median rows, but QUANTILE (aka PERCENTILE_DISC) always selects a row instead of interpolating between rows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made the corrections

# Returns the value at the 90th percentile of supply costs for each supplier
Suppliers.CALCULATE(ninetieth_percentile_cost = QUANTILE(supply_records.supply_cost, 0.9))

# Returns the median supply cost for each supplier (equivalent to MEDIAN)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again don't say equivalent median, say the 50th percentile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed as well


### QUANTILE

The `QUANTILE` function returns the value at a specified quantile from the set of values it is called on. The quantile value `p` must be a numeric literal between 0 and 1 (inclusive), where `0` returns the minimum, `1` returns the maximum, and `0.5` returns the median.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to be very specific about what kind of quantile you are talking about here, since there are many different kinds of quantile computations in math/analytics. Specifically:

  • When doing QUANTILE(x, p), returns the smallest value of x such that p of the rows are less than or equal to it (this is what PERCENTILE_DISC is)
  • Ignores NULL records in computation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 778 to 795
Rewrites a QUANTILE aggregation call into an equivalent expression using window functions.
This is typically used for dialects that do not natively support the PERCENTILE_DISC
aggregate function.

The rewritten expression selects the value at the specified quantile by:
- Ranking the rows within each partition.
- Calculating the number of rows (N) in each partition.
- Keeping only those rows where the rank is greater than INTEGER((1.0 - p) * N),
where p is the quantile argument.
- Taking the maximum value among the kept rows.

Args:
child_connection: The HybridConnection containing the aggregate call to QUANTILE.
expr: The HybridFunctionExpr representing the QUANTILE aggregation.
create_new_calc: If True, injects new expressions into a new CALCULATE operation.

Returns:
A HybridFunctionExpr representing the rewritten aggregation using window functions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure none of these lines are over 80 characters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

)

# (1.0-args[1])
sub: HybridExpr = HybridFunctionExpr(pydop.SUB, [one, p], NumericType())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we know p is a numeric literal, we can forgo this expr entirely and just pass in the new literal:

Suggested change
sub: HybridExpr = HybridFunctionExpr(pydop.SUB, [one, p], NumericType())
sub: HybridExpr = HybridLiteralExpr(Literal(1.0 - p.literal.value, NumericType()))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. It changed the sql, just make sure everything still looks fine

Comment on lines 1697 to 1710
Converts a PyDough QUANTILE(X, p) function call to a SQLGlot expression
representing the SQL standard PERCENTILE_DISC aggregate function.

This produces an expression equivalent to:
PERCENTILE_DISC(p) WITHIN GROUP (ORDER BY X)

Args:
args: A list of two SQLGlot expressions, where args[0] is the column or
expression to order by (X), and args[1] is the quantile value (p) between 0 and 1.
types: The PyDough types of the arguments.

Returns:
A SQLGlotExpression representing the PERCENTILE_DISC(p) WITHIN GROUP (ORDER BY X)
aggregate function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

80 character lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if (
not isinstance(args[1], sqlglot_expressions.Literal)
or args[1].is_string
# or not isinstance(args[1].this, (int, float))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either fix or delete this, but don't leave the comment as dead code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted


assert len(args) == 2

# validation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should be clearer: # Validate that the second argument is a numeric literal between 0 and 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@knassre-bodo knassre-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments remaining while you address the testing

Comment on lines 778 to 782
Rewrites a QUANTILE aggregation call into an equivalent expression using
window functions.
This is typically used for dialects that do not natively support the
PERCENTILE_DISC
aggregate function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newlines here are a bit off. Also, wrap functions like QUANTILE or PERCENTILE_DISC in backticks (`) for tooltip formatting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address this before merging

Comment on lines 1 to 7
WITH _s0 AS (
SELECT
n_name,
n_nationkey,
n_regionkey
FROM tpch.nation
ORDER BY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to delete the test_3 and test_4 SQL files since they are no longer part of test_pydough_to_sql.py.

PyDoughPandasTest(
quantile_function_test_1,
"TPCH",
# Answer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to write # Answer here every time

Comment on lines 778 to 782
Rewrites a QUANTILE aggregation call into an equivalent expression using
window functions.
This is typically used for dialects that do not natively support the
PERCENTILE_DISC
aggregate function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address this before merging

@john-sanchez31 john-sanchez31 merged commit 85838f5 into main Jul 3, 2025
5 checks passed
@john-sanchez31 john-sanchez31 deleted the John/quantile branch July 3, 2025 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation effort - medium mid-sized issue with average implementation time/difficulty extensibility Increasing situations in which PyDough works
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add QUANTILE function to PyDough
2 participants