Executing Aggregate Functions on Multimillion Row Tables

I am having severe performance issues with a multi-million dollar row table which I feel I can get results quickly. Here's running through what I have, how I request it, and how long it takes:

  • I am running SQL Server 2008 Standard, so Partitioning is not an option at this time

  • I am trying to combine all views for all resources for a specific account for the last 30 days.

  • All views are stored in the following table:

CREATE TABLE [dbo]. [LogInvSearches_Daily] (
    [ID] [bigint] IDENTITY (1,1) NOT NULL,
    [Inv_ID] [int] NOT NULL,
    [Site_ID] [int] NOT NULL,
    [LogCount] [int] NOT NULL,
    [LogDay] [smalldatetime] NOT NULL,
 CONSTRAINT [PK_LogInvSearches_Daily] PRIMARY KEY CLUSTERED 
(
    [ID] ASC
) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]
  • This table contains 132 million entries and has over 4 concerts.

  • An example of 10 rows from a table:

ID Inv_ID Site_ID LogCount LogDay
-------------------- ----------- ----------- -------- --- -----------------------
1 486752 48 14 2009-07-21 00:00:00
2 119314 51 16 2009-07-21 00:00:00
3 313678 48 25 2009-07-21 00:00:00
4 298863 0 1 2009-07-21 00:00:00
5 119996 0 2 2009-07-21 00:00:00
6 463777 534 7 2009-07-21 00:00:00
7 339976 503 2 2009-07-21 00:00:00
8 333501 570 4 2009-07-21 00:00:00
9 453 955 0 12 2009-07-21 00:00:00
10 443291 0 4 2009-07-21 00:00:00

(10 row (s) affected)
  • I have the following index in LogInvSearches_Daily:
/ ****** Object: Index [IX_LogInvSearches_Daily_LogDay] Script Date: 05/12/2010 11:08:22 ****** /
CREATE NONCLUSTERED INDEX [IX_LogInvSearches_Daily_LogDay] ON [dbo]. [LogInvSearches_Daily] 
(
    [LogDay] ASC
)
INCLUDE ([Inv_ID],
[LogCount]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKSIM = ON)
  • I only need to pull inventory from inventory for a specific account id. I have an index in Inventory too.

I am using the following query to aggregate the data and give me the top 5 records. This query currently takes 24 seconds to return 5 rows:

StmtText
-------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ------
SELECT TOP 5
    Sum (LogCount) AS Views
    , DENSE_RANK () OVER (ORDER BY Sum (LogCount) DESC, Inv_ID DESC) AS Rank
    , Inv_ID
FROM LogInvSearches_Daily D (NOLOCK)
WHERE 
    LogDay> DateAdd (d, -30, getdate ())
    AND EXISTS (
        SELECT NULL FROM propertyControlCenter.dbo.Inventory (NOLOCK) WHERE Acct_ID = 18731 AND Inv_ID = D.Inv_ID
    )
GROUP BY Inv_ID


(1 row (s) affected)

StmtText
-------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ------
  | --Top (TOP EXPRESSION: ((5)))
       | --Sequence Project (DEFINE: ([Expr1007] = dense_rank))
            | --Segment
                 | --Segment
                      | --Sort (ORDER BY: ([Expr1006] DESC, [D]. [Inv_ID] DESC))
                           | --Stream Aggregate (GROUP BY: ([D]. [Inv_ID]) DEFINE: ([Expr1006] = SUM ([LOALogs]. [Dbo]. [LogInvSearches_Daily]. [LogCount] as [D]. [LogCount] )))
                                | --Sort (ORDER BY: ([D]. [Inv_ID] ASC))
                                     | --Nested Loops (Inner Join, OUTER REFERENCES: ([D]. [Inv_ID]))
                                          | --Nested Loops (Inner Join, OUTER REFERENCES: ([Expr1011], [Expr1012], [Expr1010]))
                                          | | --Compute Scalar (DEFINE: (([Expr1011], [Expr1012], [Expr1010]) = GetRangeWithMismatchedTypes (dateadd (day, (- 30), getdate ()), NULL, (6))))
                                          | | | --Constant Scan
                                          | | --Index Seek (OBJECT: ([LOALogs]. [Dbo]. [LogInvSearches_Daily]. [IX_LogInvSearches_Daily_LogDay] AS [D]), SEEK: ([D]. [LogDay]> [Expr1011] AND [D]. [ LogDay] <[Expr1012]) ORDERED FORWARD)
                                          | --Index Seek (OBJECT: ([propertyControlCenter]. [Dbo]. [Inventory]. [IX_Inventory_Acct_ID]), SEEK: ([propertyControlCenter]. [Dbo]. [Inventory]. [Acct_ID] = (18731) AND [ propertyControlCenter]. [dbo]. [Inventory]. [Inv_ID] = [LOA

(13 row (s) affected)

I tried using a CTE to collect the rows first and fill them in, but that didn't work faster and gives me essentially the same execution plan.

(1 row (s) affected)
StmtText
-------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ------
--SET SHOWPLAN_TEXT ON;
WITH getSearches AS (
        SELECT
            LogCount
-, DENSE_RANK () OVER (ORDER BY Sum (LogCount) DESC, Inv_ID DESC) AS Rank
            , D.Inv_ID
        FROM LogInvSearches_Daily D (NOLOCK)
            INNER JOIN propertyControlCenter.dbo.Inventory I (NOLOCK) ON Acct_ID = 18731 AND I.Inv_ID = D.Inv_ID
        WHERE 
            LogDay> DateAdd (d, -30, getdate ())
- GROUP BY Inv_ID
)

SELECT Sum (LogCount) AS Views, Inv_ID
FROM getSearches
GROUP BY Inv_ID


(1 row (s) affected)

StmtText
-------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ----------------------------------------
  | --Stream Aggregate (GROUP BY: ([D]. [Inv_ID]) DEFINE: ([Expr1004] = SUM ([LOALogs]. [Dbo]. [LogInvSearches_Daily]. [LogCount] as [D]. [LogCount] )))
       | --Sort (ORDER BY: ([D]. [Inv_ID] ASC))
            | --Nested Loops (Inner Join, OUTER REFERENCES: ([D]. [Inv_ID]))
                 | --Nested Loops (Inner Join, OUTER REFERENCES: ([Expr1008], [Expr1009], [Expr1007]))
                 | | --Compute Scalar (DEFINE: (([Expr1008], [Expr1009], [Expr1007]) = GetRangeWithMismatchedTypes (dateadd (day, (- 30), getdate ()), NULL, (6))))
                 | | | --Constant Scan
                 | | --Index Seek (OBJECT: ([LOALogs]. [Dbo]. [LogInvSearches_Daily]. [IX_LogInvSearches_Daily_LogDay] AS [D]), SEEK: ([D]. [LogDay]> [Expr1008] AND [D]. [ LogDay] <[Expr1009]) ORDERED FORWARD)
                 | --Index Seek (OBJECT: ([propertyControlCenter]. [Dbo]. [Inventory]. [IX_Inventory_Acct_ID] AS [I]), SEEK: ([I]. [Acct_ID] = (18731) AND [I]. [ Inv_ID] = [LOALogs]. [Dbo]. [LogInvSearches_Daily]. [Inv_ID] as [D]. [Inv_ID]) ORDERED FORWARD)

(8 row (s) affected)


(1 row (s) affected)

So, given that I am getting a good index. Looking in my execution plan for what I can do to speed up execution?

UPDATE:

Here, the same query is executed without DENSE_RANK (), and it takes the same 24 seconds to run, and gives me the same baseline query plan:

StmtText
-------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ------
--SET SHOWPLAN_TEXT ON
SELECT TOP 5
    Sum (LogCount) AS Views
    , Inv_ID
FROM LogInvSearches_Daily D (NOLOCK)
WHERE 
    LogDay> DateAdd (d, -30, getdate ())
    AND EXISTS (
        SELECT NULL FROM propertyControlCenter.dbo.Inventory (NOLOCK) WHERE Acct_ID = 18731 AND Inv_ID = D.Inv_ID
    )
GROUP BY Inv_ID
ORDER BY Views, Inv_ID
(1 row (s) affected)

StmtText
-------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ------
  | --Sort (TOP 5, ORDER BY: ([Expr1006] ASC, [D]. [Inv_ID] ASC))
       | --Stream Aggregate (GROUP BY: ([D]. [Inv_ID]) DEFINE: ([Expr1006] = SUM ([LOALogs]. [Dbo]. [LogInvSearches_Daily]. [LogCount] as [D]. [LogCount] )))
            | --Sort (ORDER BY: ([D]. [Inv_ID] ASC))
                 | --Nested Loops (Inner Join, OUTER REFERENCES: ([D]. [Inv_ID]))
                      | --Nested Loops (Inner Join, OUTER REFERENCES: ([Expr1010], [Expr1011], [Expr1009]))
                      | | --Compute Scalar (DEFINE: (([Expr1010], [Expr1011], [Expr1009]) = GetRangeWithMismatchedTypes (dateadd (day, (- 30), getdate ()), NULL, (6))))
                      | | | --Constant Scan
                      | | --Index Seek (OBJECT: ([LOALogs]. [Dbo]. [LogInvSearches_Daily]. [IX_LogInvSearches_Daily_LogDay] AS [D]), SEEK: ([D]. [LogDay]> [Expr1010] AND [D]. [ LogDay] <[Expr1011]) ORDERED FORWARD)
                      | --Index Seek (OBJECT: ([propertyControlCenter]. [Dbo]. [Inventory]. [IX_Inventory_Acct_ID]), SEEK: ([propertyControlCenter]. [Dbo]. [Inventory]. [Acct_ID] = (18731) AND [ propertyControlCenter]. [dbo]. [Inventory]. [Inv_ID] = [LOALogs]. [dbo]. [LogInvS

(9 row (s) affected)


Thanks,

Dan

+2


a source to share


3 answers


I haven't read your whole question yet (I'll come to it soon), but to answer an earlier comment: you can use partitioned views in standard SQL Server 2008. It splits tables (which are admittedly more flexible) which are limited corporate version.

Parified View Info: http://msdn.microsoft.com/en-us/library/ms190019.aspx

In a broader question I would like to know if you really need DENSE_RANK. I'm wondering if you're confused between the ORDER BY inside DENSE_RANK and the ORDER BY of the query itself. As it stands, TOP 5 will return 5 undefined records, as SQL Server does not guarantee any record ordering unless an ORDER BY clause is specified (which you haven't done yet). If you move the ORDER BY from DENSE_RANK down to be a complete ORDER BY query as below, the records will be returned as I think you want, and this will eliminate the need for the expensive DENSE_RANK aggregate function.

SELECT TOP 5
    SUM([LogCount]) AS [Views],
    [Inv_ID]
FROM [LogInvSearches_Daily] D (NOLOCK)
WHERE 
    [LogDay] > DateAdd(d, -30, getdate())
    AND EXISTS(
        SELECT *
        FROM Inventory (NOLOCK)
        WHERE Acct_ID = 18731
            AND Inv_ID = D.Inv_ID
    )
GROUP BY
    Inv_ID
ORDER BY
    [Views] DESC,
    [Inv_ID]

      

UPDATE:



The time is probably used here:

|--Sort(ORDER BY:([D].[Inv_ID] ASC))

      

You can try creating a coverage index like this:

CREATE NONCLUSTERED INDEX [IX_LogInvSearches_Daily_Perf] ON [dbo].[LogInvSearches_Daily] 
(
    [Inv_ID] ASC,
    [LogDay] ASC
)
INCLUDE
(
    [LogCount]
)

      

Note that I also changed the ORDER BY slightly (Inv_ID is now sorted by ASC instead of DESC). I suspect this change will not affect the results in a problematic way, but it might help performance as it will return rows in the same order in which they are grouped (although this may not be appropriate!).

+1


a source


Separation to the side

Based on our experience with a larger table than yours, we extract the data into a temp table (not a table variable) and aggregate over it. Not for all queries, but for more complex ones.



Also, I agree with Daniel Renshaw's comments on DENSE_RANK

I would also consider moving [Inv_ID], [LogCount] to the index (not including possibly with DESC sort)

+1


a source


The Acct_ID is in the Inventory table and seems to have an index for itself (IX_Inventory_Acct_ID). Perhaps if the Inventory had an index of (Acct_Id, Inv_Id) and LogInvSearches_Daily was grouped (or at least indexed) around (Inv_Id, LogDay), you would be in better luck.

BTW, I don't know how your current clustering index on LogInvSearches_Daily.ID is supposed to buy you. Why would they import records with similar identifiers on disk?

0


a source







All Articles