Anatomy of Windows Functions

Theory and practice of an underappreciated SQL operation

Introduction

The IT area is known for its constant changes, with new tools, new frameworks, new cloud providers, and new LLMs being created every day. However, even in this busy world, some principles, paradigms, and tools seem to challenge the status quo of ‘nothing is forever’. And, in the data area, there is no example of this as imposing as the SQL language.

Since its creation back in the 80s, it passed the age of Data Warehouses, materialized itself in the Hadoop/Data-lake/Big Data as Hive, and is still alive today as one of the Spark APIs. The world changed a lot but SQL remained not only alive but very important and present.

But SQL is like chess, easy to understand the basic rules but hard to master! It is a language with many possibilities, many ways to solve the same problem, many functions and keywords, and, unfortunately, many underrated functionalities that, if better known, could help us a lot when building queries.

Because of this, in this post, I want to talk about one of the not-so-famous SQL features that I found extremely useful when building my daily queries: Window Functions.

What is a Window Function

The traditional and most famous SGBDs (PostgreSQL, MySQL, and Oracle) are based on relational algebra concepts. In it, the lines are called tuples, and, the tables, are relations. A relation is a set (in the mathematical sense) of tuples, i.e. there is no ordering or connection between them. Because of that, there is no default ordering of lines in a table, and the calculus performed on one line does not impact and it is not impacted by the results of another. Even clauses like ORDER BY, only order tables, and it is not possible to make calculus in a line based on the values of other lines.

Simply put, window functions fix this, extending the SQL functionalities, and allowing us to perform calculations in one row based on the values of other lines.

Basic cases for understanding/ The Anatomy

1-Aggregating Without Aggregation

The most trivial example to understand Windows functions is the ability to ‘aggregate without aggregation’.

When we made an aggregation with traditional GROUP BY, the whole table is condensed into a second table, where each line represents a group’s element. With Windows Functions, instead of condensing the lines, it’s possible to create a new column in the same table containing the aggregation results.

For example, if you need to add up all the expenses in your expense table, traditionally you would do:

SELECT SUM(value) AS total FROM myTable

With Windows functions, you would make something like that:

SELECT *, SUM(value) OVER() FROM myTable
-- Note that the window function is defined at column-level
-- in the query

The image below shows the results:

Image 1. Traditional Group By vs Windows Functions.

Rather than creating a new table, it will return the aggregation’s value in a new column. Note that the value is the same, but the table was not ‘summarized’, the original lines were maintained — we just calculated an aggregation without aggregating the table 😉

The OVER clause is the indication that we’re creating a window function. This clause defines over which lines the calculation will be made. It is empty in the code above, so it will calculate the SUM() over all the lines.

This is useful when we need to make calculations based on totals (or averages, minimums, maximums) of columns. For example, to calculate how much each expense contributes in percentage relative to the total.

In real cases, we might also want the detail by some category, like in the example in image 2, where we have company expenses by department. Again, we can achieve the total spent by each department with a simple GROUP BY:

SELECT depto, sum(value) FROM myTable GROUP BY depto

Or specify a PARTITION logic in the window function:

SELECT *, SUM(value) OVER(PARTITION BY depto) FROM myTable

See the result:

Image 2. Traditional Group By vs Windows Functions II.

This example helps to understand why the operation is called a ‘window’ function — the OVER clause defines a set of lines over which the corresponding function will operate, a ‘window’ in the table.

In the case above, the SUM() function will operate in the partitions created by the depto column (RH and SALES) — it will sum all the values in the ‘value’ column for each item in the depto column in isolation. The group the line is part of (RH or SALES) determines the value in the ‘Total’ column.

2 — Time and Ordering awareness

Sometimes we need to calculate the value of a column in a row based on the values of other rows. A classic example is the yearly growth in a country’s GDP, computed using the current and the previous value.

Computations of this kind, where we need the value of the past year, the difference between the current and the next rows, the first value of a series, and so on are a testament to the Windows function’s power. In fact, I don’t know if this behavior could be achieved with standard SQL commands! It probably could, but would be a very complex query…

But windows functions made it straightforward, see the image below (table recording some child’s height):

SELECT 
  year, height, 
  LAG(height) OVER (ORDER BY year) AS height_last_year
FROM myTable

The function LAG( ‘column’ ) is responsible for referencing the value of ‘column’ in the previous row. You can imagine it as a sequence of steps: In the second line, consider the value of the first; In the third, the value of the second; and so on… The first line doesn’t count (hence the NULL), as it has no predecessor.

Naturally, some ordering criterion is needed to define what the ‘previous line’ is. And that’s another important concept in Windows functions: analytical functions.

In contrast to traditional SQL functions, analytical functions (like LAG) consider that there exists an ordering in the lines — and this order is defined by the clause ORDER BY inside OVER(), i.e., the concept of first, second, third lines and so on is defined inside the OVER keyword. The main characteristic of these functions is the ability to reference other rows relative to the current row: LAG references the previous row, LEAD references the next rows, FIRST references the first row in the partition, and so on.

One nice thing about LAG and LEAD is that both accept a second argument, the offset, which specifies how many rows forward (for LEAD) or backward (for LAG) to look.

SELECT 
    LAG(height,  2) OVER (ORDER BY year) as height_two_years_ago,
    LAG(height,  3) OVER (ORDER BY year) as height_three_years_ago,
    LEAD(height)    OVER (ORDER BY year) as height_next_year
FROM ...

And it is also perfectly possible to perform calculations with these functions:

SELECT 
    100*height/(LAG(height) OVER (ORDER BY year)) 
    AS "annual_growth_%"
FROM ...

3 — Time Awareness and Aggregation

Time and space are only one — once said Einsteinm, or something like that, I don’t know ¯_(ツ)_/¯

Now that we know how to partition and order, we can use these two together! Going back to the previous example, let’s suppose there are more kids on that table and we need to compute the growth rate of each one. It’s very simple, just combine ordering and partitioning! Let’s order by year and partition by child name.

SELECT 1-height/LAG(height) OVER (ORDER BY year PARTITION BY name) ...

The above query does the following — Partitions the table by child and, in each partition, orders the values by year and divides the current year height value with the previous value (and subtracts the result from one).

We’re getting closer to the full concept of ‘window’! It’s a table slice, a set of rows grouped by the columns defined in PARTITION BY that are ordered by the fields in ORDER BY, where all the computations are made considering only the rows in the same group (partition) and a specific ordering.

4-Ranking and Position

Windows functions can be divided into three categories, two of which we already talked about: Aggregation functions ( COUNT, SUM, AVG, MAX, … ) and Analytical Functions ( LAG, LEAD, FIRST_VALUE, LAST_VALUE, … ).

The third group is the simplest — Ranking Functions, with its greatest exponent being the row_number() function, which returns an integer representing the position of a row in the group (based on the defined order).

SELECT row_number() OVER(ORDER BY score)

Ranking functions, as the name indicates, return values based on the position of the line in the group, defined by the ordering criteria. ROW_NUMBER, RANK, and NTILE are some of the most used.

In the image above, a row number is created based on each player’s score

… and yes, it commits the atrocious programming sin of starting from 1.

5-Window size

All the functions presented till this point consider ALL the rows in the partition/group when computing the results. For example, the SUM() described in the first example considers all department’s rows to compute the total.

But it is possible to specify a smaller window size, i.e. how many lines before and after the current line should be considered in the computations. This is a helpful functionality to calculate moving averages / rolling windows.

Let’s consider the following example, with a table containing the daily number of cases of a certain disease, where we need to compute the average number of cases considering the current day and the two previous. Note that it’s possible to solve this problem with the LAG function, shown earlier:

SELECT
( n_cases + LAG(n_cases, 1) + LAG(n_cases, 2) )/3
OVER (ORDER BY date_reference)

But we can achieve the same result more elegantly using the concept of frames:

SELECT
AVG(n_cases)
OVER (
	ORDER BY date_reference
	ROWS BETWEEN 2 PRECEDING AND CURRENT ROW	
)

The frame above specifies that we must calculate the average looking only for the two previous (PRECEDING) rows and the current row. If we desire to consider the previous, the current line, and the following line, we can change the frame:

AVG(n_cases)
OVER (
	ORDER BY date_reference
	ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
)

And that’s all a frame is — a way to limit a function’s reach to a specific bound. By default (in most cases), windows functions consider the following frame:

ROWS BETWEEN UNBOUDED PRECEDING AND CURRENT ROW
-- ALL THE PREVIOUS ROWS + THE CURRENT ROW

Image 6. Exploring Window Size definition

I hope this introduction helps you better understand what Windows functions are, how they work, and their syntax in practice. Naturally, many more keywords can be added to Windows functions, but I think this introduction already covers many commands you’ll likely use in everyday life. Now, let’s see some interesting practical applications that I use in my daily routine to solve problems — some are very curious!

Curious and Interesting use cases for windows functions

Cumulative Sum over time

This is one of the most classic cases of using windows functions.

Imagine a table with your salary per month and you want to know how much you earned in each month cumulatively (considering all previous months), this is how it works:

Image 7. Practical example — Cumulative SUM

Pretty simple, right?

An interesting thing to note in this query is that the SUM() function considers the current row and all previous rows to calculate the aggregation, as mentioned previously.

Duration of events in log tables

I recently used this one in my post My First Billion (of Rows) in DuckDB, in which I manipulate logs from electronic voting machines in Brazil, it’s worth checking if you’re interested in the processing of large volumes of data.

In summary, imagine a log table in which each event is composed of a timestamp that indicates when it started, its name, and a unique identifier. Considering that each event only starts when the previous one ends, we can easily add a column with the event duration as follows:

Image 8. Practical example — Event Duration in Logs

Fill missing values (With the last occurrence)

Machine learning classic with pandas! Just do a fillna, bfill or whatever, and that’s it, we fill in the null values with the last valid occurrence.

How to do this in SQL? Simple!

Image 8. Practical example — Filling Missing Values I

When we first study machine learning, we work a lot with pandas and get used to their high-level functions. However, when working on a real project, the data volume can be very large, so we may not be lucky enough to use pandas and need to switch to tools such as PySpark, Snowflake, Hive+hadoop, etc — all of which, in one way or another, can be operated in SQL. Therefore, I think it is important to learn how to do these treatments and preprocessing in SQL.

Filling missing values (With the avg of preceding rows)

A slightly more elaborate way of filling in null values, but still simple!

Image 8. Practical example — Filling Missing Values II

This example highlights that, despite seeming complicated and special, windows functions can be used just like normal columns! They can be included in CASE, calculations can be done with them and so on. One of the few restrictions I know of is that they cannot be placed directly in a WHERE clause:

SELECT * FROM
WHERE SUM() OVER() > 10 -- This is not possible in postgres

Row Deduplication based on a set of columns

Another classic of windows functions! Sometimes we need to deduplicate rows in a table based on just one set of columns.

Of course, in SQL we have the DISTINCT clause, but it only works if the full line is duplicated. If a table has several lines with the same value in an ID column but with different values in the remaining columns, it’s possible to deduplicate with the following logic:

Image 9. Practical example — Deduplication

SELECT *
  FROM (
   SELECT
   ROW_NUMBER() OVER (PARTITION BY id) as row_number
  )
WHERE row_number = 1

This operation also allows data versioning! For example, if we save a new line for each time a user changed their name in the system with the date of change (instead of changing the existing line), we can retrieve each user’s current name:

SELECT
    *
FROM
(
  SELECT 
    name, 
    row_number() OVER (PARTITION BY id ORDER BY DATE DESC) AS row_number
  FROM myTable
) AS subquery
WHERE row_number = 1

Percentage of occurrences of a group/class over the total number of rows

Consider a table that lists various pets, which can be dogs, cats, or birds. We need to add a column to each row indicating the percentage that each pet type represents out of the total count of all pets. This task is solved using not one, but two window functions!

Image 10. Practical example — Percentage of Occurences

In the image above, to make it more educational, I added two columns to represent the result of each window function, but only the rightmost column is actually created.

And you, do you have any interesting windows functions cases that you would like to share? Please leave it in the comments!

Conclusion

I wouldn’t dare say that SQL is vintage or classic, as these, although positive, refer to the past. For me, SQL is present, pervasive, and, without a doubt, an essential language for anyone working in the data area.

However, several problems may seem complicated to solve using just SQL itself and, at these times, having a good knowledge of the language and its capabilities is really important. Without windows functions, many problems considered common — when looking from a Pythonic perspective — would be very difficult or even impossible to solve. But we can do magic if we know how to use the tools correctly!

I hope this post has helped you better understand how Windows functions work and what types of problems they can solve in practice. All the material presented here was mainly based on PostgreSQL syntax and may not necessarily work right away in another database, but the most important thing is the logic itself. As always, I’m not an expert on the subject and I strongly recommend deeper reading — and lots of practice — to anyone interested in the subject.

Thank you for reading! 😉

References

All the code is available in this GitHub repository.
Interested in more works like this one? Visit my posts repository.

[1] Data processing with PostgreSQL window functions. (n.d.). Timescale. Link.
[2] Kho, J. (2022, June 5). An easy guide to advanced SQL window functions — towards data science. Medium.
[3] Markingmyname. (2023, November 16). Funções analíticas (Transact-SQL) — SQL Server. Microsoft Learn.
[4] PostgreSQL Tutorial. (2021, April 27). PostgreSQL Window Functions: The Ultimate Guide. Link.
[5] VanMSFT. (2023, May 23). OVER Clause (Transact-SQL) — SQL Server. Microsoft Learn.
[6] Window Functions. (n.d.). SQLite Official docs.
[7] Window Functions. (2014, July 24). PostgreSQL Documentation.

All images in this post are made by the author.

Anatomy of Windows Functions was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Originally appeared here:
Anatomy of Windows Functions

Go Here to Read this Fast! Anatomy of Windows Functions

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.