How SQL Execution Orders Varies Across Databases
Why you can’t GROUP BY ordinal positions in SQL Server but can in others
After working regularly with open-source databases such as MySQL and PostgreSQL, I recently had the chance to work on a SQL Server project and discovered a subtle but important difference in the SQL landscapes. I observed that in SQL Server, I was unable to GROUP BY ordinal positions (GROUP BY 1, 2, 3…), which is a functionality I frequently used within other databases, particularly for rapid testing.
This discovery led me to explore several nuances of both database systems, particularly the SQL execution order, which will be the focus of this article.
Why does this matter? When working with database systems, understanding the subtle differences can greatly impact your workflow and increase your productivity. It can save you significant troubleshooting time. Also, by understanding the SQL execution order of various databases, you can craft more optimal SQL queries based on the system you are working with.
In this article, we will look into one major use case where this behaviour occurs — GROUP BY — and investigate why. However, this insight can be applied to the HAVING, WHERE or any other SQL command clause.
Let’s begin
Let’s look at this example in the query below. This will not work in SQL Server even though it works in MySQL:
SELECT
DATEPART(year, day) AS order_date,
SUM(cost) as cost
FROM clean
GROUP BY 1;
If you run this, you will probably get an error like this:
Each GROUP BY expression must contain at least one column that is not an outer reference.
However, this revised query works after replacing the GROUP BY ordinal reference with the explicit expression. You will also notice that you can reference the ordinal positions in the ORDER BY clause, which I found strange:
SELECT
datepart(year, day),
sum(cost) as cost
from clean
GROUP BY datepart(year, day)
ORDER BY 1;
In SQL Server, I quickly learned that I had to use explicit column names or expressions in the GROUP BY clause. This is considered a best practice as it makes the code easier to understand. However, I was curious about why this behavior differed between databases. Additionally, I found it interesting that the ORDER BY clause in SQL Server works with ordinal positions, which further piqued my curiosity.
Exploring the SELECT statement execution order
To find out, lets look at the SELECT statement execution/processing order for SQL server vs. other databases. It’s important to note that in SQL databases, each part of a query is executed sequentially and that order differs from how it is written.
In SQL Server, for instance, we can see from the image below and from Microsoft docs that the FROM clause is the first command to be evaluated. Furthermore, the SELECT clause runs after the GROUP BY clause. That is why we were unable to reference a column‘s position or even its alias in the GROUP BY clause in our first example!
However, we are free to reference the ordinal position and/or alias in the ORDER BY clause, as that is evaluated after the SELECT clause. The SELECT clause tells the database what columns will be returned and thus, the positioning is known at this point. Cool, right?
SQL Server execution order
MySQL
In MySQL however, I found it difficult to find clear documentation stating the order of execution of a SQL query. The execution order seems to depend on the contents of the query and what the query optimiser defines as the best path.
But from what we can see from the MySQL docs here, the clue shows us how the execution order might be and that the SELECT clause is evaluated before the GROUP BY clause:
For GROUP BY or HAVING clauses, it searches the FROM clause before searching in the select_expr values. (For GROUP BY and HAVING, this differs from the pre-MySQL 5.0 behavior that used the same rules as for ORDER BY.)
GoogleSQL
If we also look at the GoogleSQL (formerly Standard SQL) docs which is the syntax used in Google BigQuery, you will see similar deviation from the way queries are executed in SQL Server:
GROUP BY and ORDER BY can also refer to a third group: Integer literals, which refer to items in the SELECT list. The integer 1 refers to the first item in the SELECT list, 2 refers to the second item, etc.
As you can see, this behaviour is not supported in SQL Server. The Google docs also mention that GROUP BY, ORDER BY, and HAVING, can refer to aliases from the SELECT list.
With that, we can conclude with a high probability that the execution order for these other databases follows a path similar to the image below:
MySQL, PostgreSQL & BigQuery probable execution order
Conclusion
This was a short post where we looked at how the execution order in MySQL, GoogleSQL and other databases’ SQL syntaxes, differ from SQL Server’s, based on the observed behaviours and documentation. SQL Server emphasises explicitness in the GROUP BY clause for code clarity, while MySQL’s execution order definitely evaluates that SELECT clause before the GROUP BY clause, allowing us to reference the ordinal positions in it.
Feel free to share your thoughts on this subject and catch you in the next one.
You can become a Medium member to support me and enjoy more stories like this.
References
- SELECT — Transact-SQL
- ROWNUM — Oracle
- MySQL reference
- PostgreSQL – EXPLAIN
How SQL execution orders varies across databases was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
How SQL execution orders varies across databases
Go Here to Read this Fast! How SQL execution orders varies across databases