Window functions in SQL

Mastering SQL Window Functions: A Comprehensive Guide

Window functions in SQL are powerful tools for performing complex calculations across a set of rows related to the current row. Unlike typical aggregate functions that group rows, window functions allow each row to retain its original details. This makes them ideal for analytics and ranking tasks.

In this guide, we’ll explore some of the most commonly used window functions and how to use them effectively.

1. ROW_NUMBER() - Assigning Sequential Numbers

The ROW_NUMBER() function assigns a unique, sequential integer to rows within a partition (group) of the result set.

Use Case: Useful for adding unique IDs or ranking rows without ties.


SELECT id, name, value,
       ROW_NUMBER() OVER (PARTITION BY name ORDER BY value DESC) AS row_num
FROM my_table;

2. RANK() - Ranking with Ties Allowed

RANK() assigns a rank to each row within a partition. Rows with the same values get the same rank, but gaps are left in the ranking.

Use Case: Ideal when ranking with ties.


SELECT id, name, value,
       RANK() OVER (PARTITION BY name ORDER BY value DESC) AS rank
FROM my_table;

3. DENSE_RANK() - Ranking Without Gaps

Similar to RANK(), DENSE_RANK() allows ties but without leaving gaps in the ranks.

Use Case: Ranking without gaps.


SELECT id, name, value,
       DENSE_RANK() OVER (PARTITION BY name ORDER BY value DESC) AS dense_rank
FROM my_table;

4. NTILE(n) - Dividing Rows into Buckets

NTILE(n) divides the result set into n equal parts, assigning a bucket number to each row.

Use Case: Dividing data into quantiles (e.g., quartiles or deciles).


SELECT id, name, value,
       NTILE(4) OVER (ORDER BY value) AS quartile
FROM my_table;

5. LEAD() - Accessing the Next Row

The LEAD() function accesses data from the next row in the partition, returning NULL for the last row.

Use Case: Comparing a row with the next row’s values.


SELECT id, name, value,
       LEAD(value, 1) OVER (PARTITION BY name ORDER BY value) AS next_value
FROM my_table;

6. LAG() - Accessing the Previous Row

LAG() allows access to data from the previous row within the partition, returning NULL for the first row.

Use Case: Comparing a row with the previous row’s values.


SELECT id, name, value,
       LAG(value, 1) OVER (PARTITION BY name ORDER BY value) AS prev_value
FROM my_table;

7. FIRST_VALUE() - Fetching the First Row Value

The FIRST_VALUE() function returns the first value in the window or partition.

Use Case: Extracting the initial value for comparison across the partition.


SELECT id, name, value,
       FIRST_VALUE(value) OVER (PARTITION BY name ORDER BY value) AS first_value
FROM my_table;

8. LAST_VALUE() - Fetching the Last Row Value

Similar to FIRST_VALUE(), LAST_VALUE() returns the last value in the window or partition.

Use Case: Extracting the latest value within a partition.


SELECT id, name, value,
       LAST_VALUE(value) OVER (PARTITION BY name ORDER BY value) AS last_value
FROM my_table;

9. CUME_DIST() - Calculating Cumulative Distribution

CUME_DIST() calculates the cumulative distribution of a value in the partition, assigning a relative rank between 0 and 1.

Use Case: Calculating the percentile rank of rows.


SELECT id, name, value,
       CUME_DIST() OVER (PARTITION BY name ORDER BY value) AS cume_dist
FROM my_table;

10. PERCENT_RANK() - Calculating Percentile Rank

PERCENT_RANK() calculates the percentile rank for each row in the partition, providing values between 0 and 1 without ties.

Use Case: Finding the percentile rank.


SELECT id, name, value,
       PERCENT_RANK() OVER (PARTITION BY name ORDER BY value) AS percent_rank
FROM my_table;

11. Aggregate Functions SUM(), AVG(), COUNT() with Window

You can also use aggregate functions like SUM(), AVG(), and COUNT() with a window to calculate running totals, averages, and counts within each partition.

Use Case: Calculating running totals or moving averages.


SELECT id, name, value,
       SUM(value) OVER (PARTITION BY name ORDER BY value ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM my_table;

Summary

These SQL window functions provide incredible flexibility for analytical queries without losing row details. By combining them with PARTITION BY and ORDER BY clauses, you can perform complex calculations across subsets of data, making it easy to derive meaningful insights.

Try these examples to enhance your SQL skills and unlock the power of window functions for your data analysis tasks!

Comments

Popular posts from this blog

Databricks Auto Loader

Optimizing Spark memory - Resolving OutOfMemory (OOM)

PySpark tips for day-to-day use