Skip to content

Query Design in data warehouse

August 25, 2011

1. Keep your queries simple

  •  Try to use a single star join with grouping and aggregation when possible because SQL Server Enterprise Edition has specific optimizations for these types of queries
  • Avoid using expressions, functions, or local variables in the WHERE clause, as these will cause SQL Server to “guess” the selectivity of a condition. This way the query optimizer will be much more likely to choose a good query plan
  •  Avoid including parameters or variables (such as @ParameterName or @VariableName) in your data warehouse queries unless you know you need to save on compile time. Use literals (that is, actual numbers, dates, or strings) in your query instead of parameters. This will help you get a better query plan but for data warehouse queries, the cost of compiling the plan is usually dwarfed by the cost of execution, so plan reuse is not important. If you want to parameterize your queries to make it easier to program your application or to improve security as a way to prevent SQL injection, use the OPTION(RECOMPILE) hint on your query to make sure you get the best plan for the specific parameter values you are passing in to your query.

2. Consider using temp tables instead of CTEs

Although common table expressions (CTEs) are convenient for programming, they do not force SQL Server to generate a temporary partial query result. Heavy use of CTEs in your queries may lead to redundant computation of the CTE expression results by the query plan in some cases. For better performance, consider using temp tables instead of CTEs, in order to break down a large business question into manageable chunks. This may of course mean you have to split a single query with several CTEs into multiple queries that retrieve results into temp tables. Temp tables have statistics while table variables do not, which is why we do not recommend table variables here.

3. Should put an explicit date filter on the fact table (instead dimension tables)

Date key is encoded as integers in the form YYYYMMDD

To find the top 10 products in total sales between January 1 and 7, 2008, the following would be good query to use:

 SELECT TOP 10 p.ProductKey,SUM(f.SalesAmount)

FROM FactInternetSales f,DimProduct p

WHERE f.ProductKey=p.ProductKey AND p.ProductAlternateKey likeN’BK-%’

AND f.OrderDateKey between 20080101 AND20080107

GROUPBY p.ProductKey

ORDERBYSUM(f.SalesAmount)desc

Avoid a query like the following when possible because it filters on date via join:

SELECT TOP 10 p.ProductKey,SUM(f.SalesAmount)

FROM FactInternetSales f, DimProduct p,DimTime t

WHERE f.ProductKey=p.ProductKey AND p.ProductAlternateKey likeN’BK-%’

AND f.OrderDateKey = t.TimeKey

AND t.MonthNumberOfYear =1

AND t.CalendarYear =2008

AND t.DayNumberOfMonth between 1 AND7

GROUPBY p.ProductKey

ORDERBYSUM(f.SalesAmount)desc

 If you must include values from a period dimension in the output of your query to include it in a report or to control the grouping level for aggregation, you can join it to the fact table but still not use it to filter. You would still filter explicitly on the fact table. For example, the following query joins on DimTime but does not filter using it, and pulls in the year and month for display:

 SELECT TOP 10 p.ProductKey, SUM(f.SalesAmount), t.CalendarYear, t.EnglishMonthName

FROM FactInternetSales f, DimProduct p, DimTime AS t

WHERE f.ProductKey=p.ProductKey AND p.ProductAlternateKey likeN’BK-%’

AND f.OrderDateKey between 20080101 AND20080107

AND f.OrderDateKey=t.TimeKey

GROUPBY p.ProductKey, t.CalendarYear, t.EnglishMonthName

ORDERBYSUM(f.SalesAmount)desc

 4. Avoid complex expressions inside aggregates in your query

For example, instead of SUM(1.1*x) use 1.1*SUM(X). This can reduce the amount of time spent evaluating expressions at the innermost level of query execution

5. Avoid mixing DISTINCT and non-DISTINCT aggregates in the same query

SQL Server sometimes generates multi-consumer spool operators as part of a query plan. The portion of the query plan above a multi-consumer spool typically is not parallelized. For this reason, if you have a performance problem with a query that has a multi-consumer spool, consider rewriting the query to eliminate the multi-consumer spool. Multi-consumer spools typically arise when you mix DISTINCT and non-DISTINCT aggregates in the same query. In the following query, we use AGG1 and AGG2 to represent any aggregates, such as SUM, COUNT, MIN, and MAX. For example, a query or subquery of this form may generate a multi-consumer spool:

 SELECT a, AGG1(DISTINCT b), AGG2(c)

FROM table

GROUP BY a

You can rewrite this to eliminate the use of DISTINCT and thus remove the multi-consumer spool from the plan, as follows:

 SELECT a, AGG1(b), AGG2(c)

FROM

(

      SELECT a, b, AGG2(c)

      FROM table

      GROUP BY a, b

) as T

GROUP BY a

This rewrite is only correct if the aggregates AGG1 and AGG2 can be computed incrementally from their input, regardless of how the input has been ordered, or partially grouped and aggregated. SUM, COUNT, MIN, and MAX all have these required characteristics. However, user-defined CLR aggregates might not have the necessary properties.

6. Use GROUPING SETS

The new GROUPING SETS feature in SQL Server 2008 can enable you to avoid redundant computation and shorten your queries. For example, you want to create a report that shows sales totals for a year, quarter, and country, in addition to country totals, period totals, and a grand total. A typical way to display this information would be a report with the format shown in the following figure.

Grouping Sets Pivot Table.gif

Figure 1: Pivot table report layout suitable for displaying data retrieved by GROUPING SETS query

The GROUP BY CUBE notation in SQL is not sufficient to express this query because the period of interest is represented by two columns, not one. The new GROUPING SETS feature enables you to specify all sets of grouping columns you are interested in, in a single relatively simple query as follows:

 USE AdventureWorksDW;

SELECT      D.CalendarYear, D.CalendarQuarter, T.SalesTerritoryCountry,

      SUM(F.SalesAmount) AS SalesAmount

FROM  dbo.FactResellerSales F

            INNER JOIN dbo.DimTime D ON F.OrderDateKey = D.TimeKey

            INNER JOIN dbo.DimSalesTerritory T ON

                  F.SalesTerritoryKey = T.SalesTerritoryKey

WHERE D.CalendarYear IN (‘2003’, ‘2004’)

GROUP BY GROUPING SETS (

      (D.CalendarYear, D.CalendarQuarter, T.SalesTerritoryCountry),

      (D.CalendarYear, D.CalendarQuarter),

      (T.SalesTerritoryCountry),

      ()

);

 Without using GROUPING SETS, the simplest way to write this query is as the union of several SELECTs, one for each grouping set, similar to the following:

 SELECT      D.CalendarYear, D.CalendarQuarter, T.SalesTerritoryCountry,

      SUM(F.SalesAmount) AS SalesAmount

FROM  dbo.FactResellerSales F

            INNER JOIN dbo.DimTime D ON F.OrderDateKey = D.TimeKey

            INNER JOIN dbo.DimSalesTerritory T ON

                  F.SalesTerritoryKey = T.SalesTerritoryKey

WHERE D.CalendarYear IN (‘2003’, ‘2004’)

GROUP BY D.CalendarYear, D.CalendarQuarter, T.SalesTerritoryCountry

UNION ALL

SELECT      D.CalendarYear, D.CalendarQuarter, NULL,

      SUM(F.SalesAmount) AS SalesAmount

FROM  dbo.FactResellerSales F

            INNER JOIN dbo.DimTime D ON F.OrderDateKey = D.TimeKey

            INNER JOIN dbo.DimSalesTerritory T ON

                  F.SalesTerritoryKey = T.SalesTerritoryKey

WHERE D.CalendarYear IN (‘2003’, ‘2004’)

GROUP BY D.CalendarYear, D.CalendarQuarter

UNION ALL

SELECT      NULL, NULL, T.SalesTerritoryCountry,

      SUM(F.SalesAmount) AS SalesAmount

FROM  dbo.FactResellerSales F

            INNER JOIN dbo.DimTime D ON F.OrderDateKey = D.TimeKey

            INNER JOIN dbo.DimSalesTerritory T ON

                  F.SalesTerritoryKey = T.SalesTerritoryKey

WHERE D.CalendarYear IN (‘2003’, ‘2004’)

GROUP BY T.SalesTerritoryCountry

UNION ALL

SELECT      NULL, NULL, NULL,

      SUM(F.SalesAmount) AS SalesAmount

FROM  dbo.FactResellerSales F

            INNER JOIN dbo.DimTime D ON F.OrderDateKey = D.TimeKey

            INNER JOIN dbo.DimSalesTerritory T ON

                  F.SalesTerritoryKey = T.SalesTerritoryKey

WHERE D.CalendarYear IN (‘2003’, ‘2004’)

 This second form of the query is of course much longer, and the query plan generated for it is less efficient, with a run time about 2.5 times longer than the GROUPING SETS form.

Reference to Book online of Microsoft

Advertisements
Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: