Sunday, February 19, 2012

Could anyone test this on SQL2005 AdventureWorks

SELECT TerritoryId,
[2003-01],[2003-02],[2003-03],[2003-04],[2003-05],[2003-06],
[2003-07],[2003-08],[2003-09],[2003-10],[2003-11],[2003-12],
[2004-01],[2004-02],[2004-03],[2004-04],[2004-05],[2004-06],
[2004-07]
from ( select TerritoryId,
CONVERT(CHAR(7), h.OrderDate, 120) AS theDate, d.LineTotal
FROM Sales.SalesOrderHeader h JOIN Sales.SalesOrderDetail d ON h.SalesOrderID = d.SalesOrderID
) p
pivot (
sum(LineTotal) for theDate in ([2003-01],[2003-02],[2003-03],[2003-04],[2003-05],[2003-06],
[2003-07],[2003-08],[2003-09],[2003-10],[2003-11],[2003-12],
[2004-01],[2004-02],[2004-03],[2004-04],[2004-05],[2004-06],[2004-07])
) pvt
order by TerritoryId

The code is based on a sample from Pro SQL2005 (Apress). The book is great BTW.

While modifying it for a testdrive I think I made an error but where. Inner select executes in less than a second. However applying the pivot it takes 5-6 seconds.

Total row count is around 121K. All it does with PIVOT is to sum and crosstab. If I do this taking raw data from inner select outside and do pivoting myself it's much faster. So is it my code here or PIVOT is really an unoptimized addition?

TIA

Some more statistics on this process. If execute this pivot and then just inner select (after a few times) here is client statistics from SSMS:

*********************************************************************

Client Execution Time 18:32:27 18:31:33

Query Profile Statistics

Number of INSERT, DELETE and UPDATE statements 0 0 0.0000

Rows affected by INSERT, DELETE, or UPDATE statements 0 0 0.0000

Number of SELECT statements 2 2 2.0000

Rows returned by SELECT statements 121318 11 60664.5000

Number of transactions 0 0 0.0000

Network Statistics

Number of server roundtrips 3 3 3.0000

TDS packets sent from client 3 3 3.0000

TDS packets received from server 635 41 338.0000

Bytes sent from client 538 1528 1033.0000

Bytes received from server 2591244 157727 1374486.0000

Time Statistics

Client processing time 375 0 187.5000

Total execution time 390 5187 2788.5000

Wait time on server replies 15 5187 2601.0000

*********************************************************************

To summarize it, SQL server selects the raw data in 390 milliseconds then spends considerable time on pivoting for a total time of 5187 milliseconds.

Here is the timings if I instead just ask for the inner select data (121317 rows) via ODBC, sum and pivot myself externally, again creating the same result table:

For same period range:

Got data-elapsed:457

XTabbed-total elapsed:729

For all 37 months:

Got data-elapsed:455

XTabbed-total elapsed:741

Timings are in milliseconds.

I believe I'm using PIVOT syntax wrong.

|||

I don't see that much performance difference between PIVOT and custom query. The PIVOT operator translates to pretty much same expressions. There is little bit additional cost but not significant. And the IO is identical. I tried this on SQL Server 2005 build# 2153 (with cumulative hotfix for SP1).

use AdventureWorks

go

set statistics io on

set statistics time on

go

SELECT TerritoryID,
[2003-01],[2003-02],[2003-03],[2003-04],[2003-05],[2003-06],
[2003-07],[2003-08],[2003-09],[2003-10],[2003-11],[2003-12],
[2004-01],[2004-02],[2004-03],[2004-04],[2004-05],[2004-06],
[2004-07]
from ( select TerritoryID,
CONVERT(CHAR(7), h.OrderDate, 120) AS theDate, d.LineTotal
FROM Sales.SalesOrderHeader h JOIN Sales.SalesOrderDetail d ON h.SalesOrderID = d.SalesOrderID
) p
pivot (
sum(LineTotal) for theDate in ([2003-01],[2003-02],[2003-03],[2003-04],[2003-05],[2003-06],
[2003-07],[2003-08],[2003-09],[2003-10],[2003-11],[2003-12],
[2004-01],[2004-02],[2004-03],[2004-04],[2004-05],[2004-06],[2004-07])
) pvt
order by TerritoryID;


SELECT TerritoryID
,sum(case theDate when '2003-01' then LineTotal end) as [2003-01]
,sum(case theDate when '2003-02' then LineTotal end) as [2003-02]
,sum(case theDate when '2003-03' then LineTotal end) as [2003-03]
,sum(case theDate when '2003-04' then LineTotal end) as [2003-04]
,sum(case theDate when '2003-05' then LineTotal end) as [2003-05]
,sum(case theDate when '2003-06' then LineTotal end) as [2003-06]
,sum(case theDate when '2003-07' then LineTotal end) as [2003-07]
,sum(case theDate when '2003-08' then LineTotal end) as [2003-08]
,sum(case theDate when '2003-09' then LineTotal end) as [2003-09]
,sum(case theDate when '2003-10' then LineTotal end) as [2003-10]
,sum(case theDate when '2003-11' then LineTotal end) as [2003-11]
,sum(case theDate when '2003-12' then LineTotal end) as [2003-12]
,sum(case theDate when '2004-01' then LineTotal end) as [2004-01]
,sum(case theDate when '2004-02' then LineTotal end) as [2004-02]
,sum(case theDate when '2004-03' then LineTotal end) as [2004-03]
,sum(case theDate when '2004-04' then LineTotal end) as [2004-04]
,sum(case theDate when '2004-05' then LineTotal end) as [2004-05]
,sum(case theDate when '2004-06' then LineTotal end) as [2004-06]
,sum(case theDate when '2004-07' then LineTotal end) as [2004-07]
from ( select TerritoryID,
CONVERT(CHAR(7), h.OrderDate, 120) AS theDate, d.LineTotal
FROM Sales.SalesOrderHeader h JOIN Sales.SalesOrderDetail d ON h.SalesOrderID = d.SalesOrderID
) p
group by TerritoryID
order by TerritoryID;

go

set statistics io off

set statistics time off

go

|||

Thanks for trying. I thought my PIVOT syntax was wrong.

The point is that I really didn't expect PIVOT to be only a syntax enhancement over all those many sum(case ....) coding but as "how it should be done" logic as well. In other words I expected it would create a more clever algorithm behind the scenes and return the results fast. I'm not good at T-SQL but even with my limited knowledge check this one:

use AdventureWorks
go
with myPivot as
(
select territoryID, theDate, sum(LineTotal) as LineTotal
from ( select TerritoryID,
CONVERT(CHAR(7), h.OrderDate, 120) AS theDate, d.LineTotal
FROM Sales.SalesOrderHeader h JOIN Sales.SalesOrderDetail d ON h.SalesOrderID = d.SalesOrderID
) p
group by TerritoryID, theDate
)
SELECT TerritoryID
,sum(case theDate when '2003-01' then LineTotal end) as [2003-01]
,sum(case theDate when '2003-02' then LineTotal end) as [2003-02]
,sum(case theDate when '2003-03' then LineTotal end) as [2003-03]
,sum(case theDate when '2003-04' then LineTotal end) as [2003-04]
,sum(case theDate when '2003-05' then LineTotal end) as [2003-05]
,sum(case theDate when '2003-06' then LineTotal end) as [2003-06]
,sum(case theDate when '2003-07' then LineTotal end) as [2003-07]
,sum(case theDate when '2003-08' then LineTotal end) as [2003-08]
,sum(case theDate when '2003-09' then LineTotal end) as [2003-09]
,sum(case theDate when '2003-10' then LineTotal end) as [2003-10]
,sum(case theDate when '2003-11' then LineTotal end) as [2003-11]
,sum(case theDate when '2003-12' then LineTotal end) as [2003-12]
,sum(case theDate when '2004-01' then LineTotal end) as [2004-01]
,sum(case theDate when '2004-02' then LineTotal end) as [2004-02]
,sum(case theDate when '2004-03' then LineTotal end) as [2004-03]
,sum(case theDate when '2004-04' then LineTotal end) as [2004-04]
,sum(case theDate when '2004-05' then LineTotal end) as [2004-05]
,sum(case theDate when '2004-06' then LineTotal end) as [2004-06]
,sum(case theDate when '2004-07' then LineTotal end) as [2004-07]
from myPivot
group by TerritoryID
order by TerritoryID
go

Or this one (that I based on how you wrote the comparison T-SQL):

with myPivot as
( select territoryID, theDate, sum(LineTotal) as LineTotal
from (
select TerritoryID,
CONVERT(CHAR(7), h.OrderDate, 120) AS theDate, d.LineTotal
FROM Sales.SalesOrderHeader h JOIN Sales.SalesOrderDetail d ON h.SalesOrderID = d.SalesOrderID
) p
group by TerritoryID, theDate
)
SELECT TerritoryID,
[2003-01],[2003-02],[2003-03],[2003-04],[2003-05],[2003-06],
[2003-07],[2003-08],[2003-09],[2003-10],[2003-11],[2003-12],
[2004-01],[2004-02],[2004-03],[2004-04],[2004-05],[2004-06],
[2004-07]
from myPivot
pivot (
sum(LineTotal) for theDate in ([2003-01],[2003-02],[2003-03],[2003-04],[2003-05],[2003-06],
[2003-07],[2003-08],[2003-09],[2003-10],[2003-11],[2003-12],
[2004-01],[2004-02],[2004-03],[2004-04],[2004-05],[2004-06],[2004-07])
) pvt
order by TerritoryID;

They both now have closer timing to my "get the raw data via ODBC and pivot yourself" approach (those 2 execute in 511 and 500 milliseconds vs to my 470 milliseconds which I accept to be the same). That leaded me to think why I need to specify the aggregate in PIVOT if I have to do aggregation myself outside for it to perform faster. Actually I think it would be way cooler if PIVOT didn't need an aggregate function but simply made a row to columns transposion.

However I checked my version and it's built 1399. Maybe it was corrected in 2153. I'll update and retest.

|||

I applied SP1 and now my version reads 2047 (not 2153).

Microsoft SQL Server Management Studio 9.00.2047.00
Microsoft Analysis Services Client Tools 2005.090.2047.00
Microsoft Data Access Components (MDAC) 2000.085.1117.00 (xpsp_sp2_rtm.040803-2158)
Microsoft MSXML 2.6 3.0 4.0 5.0 6.0
Microsoft Internet Explorer 6.0.2900.2180
Microsoft .NET Framework 2.0.50727.42
Operating System 5.1.2600

No improvement with this one. Probably pivot was only meant to be a somewhat less typing syntax. OK I was concerned if my syntax was the culprit.

|||Yes, currently PIVOT only provides syntax level simplification. It does pretty much the same as hand coded SQL statement. However, this may change in a future version of SQL Server. So if you have a situation where you can use PIVOT then use it. Currently, you cannot pivot on multiple expressions so for that you can go the traditional SQL route.|||Thank you very much. Knowing how it works is sufficient:)|||

BTW somewhere I saw you were collecting enhancement feature ideas. Even if the PIVOT stays same I'd like a version that looks like:

-- rowColumnList is typically group by columns (like TerritoryID) except the one that would be new columns (like theDate)

PIVOT ( (rowColumnList) (valueColumn) for filterColumn in (filter_columnNameList) )

and/or:

-- Same as above but doesn't need a columnnames list ( all unique values of theDate would be new columns)

PIVOT ( (rowColumnList) (valueColumn) on (columnNamesColumn) )

with no aggregation function in pivot_clause. PIVOT then merely do a row to column transposition only. Aggregation would be in select that pivot uses if need be (so users would not need tricks like max() on character/datetime data when there is really nothing to aggregate).

Thanks.

No comments:

Post a Comment