What is a Correlated Subquery?
A correlated subquery is a query that depends on another query. Another way to describe it is when a TSQL command is made up of more than one TSQL statement where the inner TSQL statement depends on the results of the outer TSQL statement. Because the inner query depends on the outer query the inner query is called a correlated subquery. When I say "depends on the outer query" I mean the inner query references a column or columns from the outer query. Because the inner query references columns from the outer query it also can't be execute as a stand-alone TSQL statement. The correlated subquery might be run many times. It is run once for every row returned from the outer query. To better understand what a correlated subquery is let me show you a couple of examples that use a correlated subquery in the WHERE and HAVING clause and how a correlated subquery can be used in an UPDATE statement.
Sample Database for Examples
Rather than build my own test data I'm going to use the AdventureWorks2012 database. If you want to run the examples in this article you can download this sample database from the following location:
Correlated Subquery in WHERE Clause
This first example will show you how a correlated subquery can be used within the WHERE constraint. Suppose I have a need to identify the customer that has the highest SubTotal amount per region, so I can reward them. To accomplish this customer identification I can run the following code:
USE AdventureWorks2012; GO SELECT SOH.CustomerID , SOH.SalesOrderID , TerritoryID , TotalDue FROM Sales.SalesOrderHeader AS SOH WHERE SOH.SalesOrderNumber = (SELECT TOP 1 SalesOrderNumber FROM Sales.SalesOrderHeader WHERE TerritoryID = SOH.TerritoryId ORDER BY TotalDue) ORDER BY TerritoryID;
When I run this code I get the following output:
CustomerID SalesOrderID TerritoryID TotalDue -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - 28094 53084 1 2.5305 30078 51782 2 1.5183 18315 60985 3 8.0444 11527 52371 4 2.5305 11533 57966 5 2.5305 11142 52682 6 2.5305 20142 52051 7 4.409 24561 52392 8 4.409 15640 51885 9 2.5305 29644 65214 10 3.0365
The part of the code that is the correlated subquery can be found on the right side of the WHERE constraint. Here is that subquery code:
SELECT TOP 1 SalesOrderNumber FROM Sales.SalesOrderHeader WHERE TerritoryID = SOH.TerritoryId ORDER BY TotalDue
You can tell this is a correlated subquery because the SOH.TerritoryID column is referencing the SalesOrderHeader table from the outer query. Additionally if you run this code as is it will fail, whereas a normal subquery would be able to be run independently of the complete query.
Correlated Subquery in HAVING Clause
In this example I want to return all the CustomerID's for those customers that order more product in 2008 than they did in 2007. To do this I will use a correlated subquery in the HAVING clause. Here is the query that will identify those CustomerID's that bought more product in 2008:
SELECT CustomerID FROM Sales.SalesOrderHeader SOH WHERE YEAR(OrderDate) = 2008 GROUP BY CustomerID HAVING SUM(SubTotal) > ( SELECT SUM(SubTotal) FROM Sales.SalesOrderHeader WHERE YEAR(OrderDate) = 2007 AND CustomerID = SOH.CustomerID GROUP BY CustomerID)
Here you can see there are two different SELECT statements that make up this TSQL statement. This first SELECT statement is known as the outer query. The second SELECT statement, which is part of the HAVING clause, is known as the inner query. In the second SELECT statement I refer to the column SOH.CustomerID, which is referring to the CustomerID returned from the outer query that placed orders in 2008. Because the second SELECT statement referred to a column returned from the first SELECT statement is why this second SELECT statement is a correlated subquery. This correlated subquery returns the sum of the SubTotal column for 2007 orders for each CustomerID that was identified in the outer query, or another way to say it the correlated subquery runs for every customer that placed an order in 2008. This summed up SubTotal value of the 2007 orders for a specific customer's 2007 orders is then compare to the summed up SubTotal for the 2008 orders for the same customer. If the 2008 summed up SubTotal is more than the 2007 summed up SubTotal amount then that CustomerID is returned from this query.
Performing an Update Using a Correlated Subquery
You can even use a correlated subquery in an UPDATE statement. To demo this I will be using the following script to CREATE and populate my UpdateDemo table:
USE TempDB; GO -- Create demo table CREATE TABLE UpdateDemo ( ID INT, ParentID INT, BirthDate DATE, Sex Char(1), NumOfChildren INT); -- Populated table with Parent Record INSERT INTO UpdateDemo VALUES(1,0,'10-18-1990','F',0); -- Insert child Record INSERT INTO UpdateDemo VALUES(2,1,'10-12-2013','M',0); -- Display Results SELECT * FROM UpdateDemo;
When I run this code my UpdateDemo table will be created and then populated with two records. Those two records look like this:
ID ParentID BirthDate Sex NumOfChildren -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - 1 0 1990-10-18 F 0 2 1 2013-10-12 M 0
Here you can see that I created two records in the UpdateDemo table. The design of this table is to track parents and the number of children they have. As you can see the NumOfChildren column is set to zero for both records in the UpdateDemo table. The zero value for the NumOfChildren on the row that has an ID value of 1 is not correctly populated because that parent record does have a child record. You can see this by looking at the ParentID column. Note that for the row with ID 2, the ParentID is set to 1. This is the child record for ID 1. In order to update the NumOfChildren column for ID 1 I will use the UPDATE statement in the following code, which contains a correlated subquery:
-- Update Table with Subquery UPDATE UpdateDemo SET NumOfChildren = (SELECT COUNT(*) FROM UpdateDemo Inner_UpdateDemo WHERE Inner_UpdateDemo.ParentID = Outer_UpdateDemo.ID) FROM UpdateDemo Outer_UpdateDemo; -- Display Results SELECT * FROM UpdateDemo;
When I run this code I get the following output returned from the SELECT statement:
ID ParentID BirthDate Sex NumOfChildren -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - 1 0 1990-10-18 F 1 2 1 2013-10-12 M 0
By looking at the output of the above correlated subquery you can see that the NumOfChildren column is now set to 1. This was accomplished by the correlated UPDATE statement. If you look at the SET clause in the UPDATE statement above you can see that I counted the number of rows in the UpdateDemo table that meet the following WHERE constraint:
WHERE Inner_UpdateDemo.ParentID = Outer_UpdateDemo.ID
In this WHERE constraint you can see that on the right side of the equal sign I referenced a column with an alias name of Outer_UpdateDemo.ID. This column reference is what makes this SELECT statement in the SET clause a correlated subquery, because this column is referencing the table alias from the outer query. Also if you try to run this SELECT statement by itself it will fail with the following error:
Msg 4104, Level 16, State 1, Line 22 The multi-part identifier "Outer_UpdateDemo.ID" could not be bound.
This is error a dead giveaway that this is a correlated subquery.
The following code tests out the update statement again to make sure it still works after I add more row to my UpdateDemo table by running the following code:
-- INSERT Another Child arecord INSERT INTO UpdateDemo VALUES(2,1,'7-1-2015','F',0); -- Update Table Again UPDATE UpdateDemo SET NumOfChildren = (SELECT COUNT(*) FROM UpdateDemo Inner_UpdateDemo WHERE Inner_UpdateDemo.ParentID = Outer_UpdateDemo.ID) FROM UpdateDemo Outer_UpdateDemo; -- Display Results SELECT * FROM UpdateDemo;
When I run this code I get the following output:
ID ParentID BirthDate Sex NumOfChildren -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - 1 0 1990-10-18 F 2 2 1 2013-10-12 M 0 3 1 2015-07-01 F 0
Now you can see the NumOfChildren column for the row with an ID value of 1 is equal to 2.
This example only showed using a correlated subquery in an UPDATE statement. You can also us correlated subqueries in the INSERT, and DELETE statements as well.
Performance Issues with Correlated Subqueries
A correlated subquery is executed for each candidate row from the outer query. Because of this the correlated subquery will be executed over, and over again. If an outer query returns a large result set the correlated subquery might not scale well. When a correlated subquery runs slow a different solution will be necessary, such as an INNER JOIN, or constraining your outer query to return fewer rows.
Conclusion
As you can see correlated subqueries are a great way to relate two different sets of records to identify the final set. By using a correlated subquery you can filter out rows in the outer query based on the results of the inner query. Next time you have a need to restrict rows in one result set by records that can be related to another set, then consider whether using a correlated subquery will get you the desired final set you are looking for.
No comments:
Post a Comment