Query Design in data warehouse
1. Keep your queries simple
- Try to use a single star join with grouping and aggregation when possible because SQL Server Enterprise Edition has specific optimizations for these types of queries
- Avoid using expressions, functions, or local variables in the WHERE clause, as these will cause SQL Server to “guess” the selectivity of a condition. This way the query optimizer will be much more likely to choose a good query plan
- Avoid including parameters or variables (such as @ParameterName or @VariableName) in your data warehouse queries unless you know you need to save on compile time. Use literals (that is, actual numbers, dates, or strings) in your query instead of parameters. This will help you get a better query plan but for data warehouse queries, the cost of compiling the plan is usually dwarfed by the cost of execution, so plan reuse is not important. If you want to parameterize your queries to make it easier to program your application or to improve security as a way to prevent SQL injection, use the OPTION(RECOMPILE) hint on your query to make sure you get the best plan for the specific parameter values you are passing in to your query.
2. Consider using temp tables instead of CTEs
Although common table expressions (CTEs) are convenient for programming, they do not force SQL Server to generate a temporary partial query result. Heavy use of CTEs in your queries may lead to redundant computation of the CTE expression results by the query plan in some cases. For better performance, consider using temp tables instead of CTEs, in order to break down a large business question into manageable chunks. This may of course mean you have to split a single query with several CTEs into multiple queries that retrieve results into temp tables. Temp tables have statistics while table variables do not, which is why we do not recommend table variables here.
3. Should put an explicit date filter on the fact table (instead dimension tables)
Date key is encoded as integers in the form YYYYMMDD
To find the top 10 products in total sales between January 1 and 7, 2008, the following would be good query to use:
SELECT TOP 10 p.ProductKey,SUM(f.SalesAmount)
FROM FactInternetSales f,DimProduct p
WHERE f.ProductKey=p.ProductKey AND p.ProductAlternateKey likeN’BK-%’
AND f.OrderDateKey between 20080101 AND20080107
GROUPBY p.ProductKey
ORDERBYSUM(f.SalesAmount)desc
Avoid a query like the following when possible because it filters on date via join:
SELECT TOP 10 p.ProductKey,SUM(f.SalesAmount)
FROM FactInternetSales f, DimProduct p,DimTime t
WHERE f.ProductKey=p.ProductKey AND p.ProductAlternateKey likeN’BK-%’
AND f.OrderDateKey = t.TimeKey
AND t.MonthNumberOfYear =1
AND t.CalendarYear =2008
AND t.DayNumberOfMonth between 1 AND7
GROUPBY p.ProductKey
ORDERBYSUM(f.SalesAmount)desc
If you must include values from a period dimension in the output of your query to include it in a report or to control the grouping level for aggregation, you can join it to the fact table but still not use it to filter. You would still filter explicitly on the fact table. For example, the following query joins on DimTime but does not filter using it, and pulls in the year and month for display:
SELECT TOP 10 p.ProductKey, SUM(f.SalesAmount), t.CalendarYear, t.EnglishMonthName
FROM FactInternetSales f, DimProduct p, DimTime AS t
WHERE f.ProductKey=p.ProductKey AND p.ProductAlternateKey likeN’BK-%’
AND f.OrderDateKey between 20080101 AND20080107
AND f.OrderDateKey=t.TimeKey
GROUPBY p.ProductKey, t.CalendarYear, t.EnglishMonthName
ORDERBYSUM(f.SalesAmount)desc
4. Avoid complex expressions inside aggregates in your query
For example, instead of SUM(1.1*x) use 1.1*SUM(X). This can reduce the amount of time spent evaluating expressions at the innermost level of query execution
5. Avoid mixing DISTINCT and non-DISTINCT aggregates in the same query
SQL Server sometimes generates multi-consumer spool operators as part of a query plan. The portion of the query plan above a multi-consumer spool typically is not parallelized. For this reason, if you have a performance problem with a query that has a multi-consumer spool, consider rewriting the query to eliminate the multi-consumer spool. Multi-consumer spools typically arise when you mix DISTINCT and non-DISTINCT aggregates in the same query. In the following query, we use AGG1 and AGG2 to represent any aggregates, such as SUM, COUNT, MIN, and MAX. For example, a query or subquery of this form may generate a multi-consumer spool:
SELECT a, AGG1(DISTINCT b), AGG2(c)
FROM table
GROUP BY a
You can rewrite this to eliminate the use of DISTINCT and thus remove the multi-consumer spool from the plan, as follows:
SELECT a, AGG1(b), AGG2(c)
FROM
(
SELECT a, b, AGG2(c)
FROM table
GROUP BY a, b
) as T
GROUP BY a
This rewrite is only correct if the aggregates AGG1 and AGG2 can be computed incrementally from their input, regardless of how the input has been ordered, or partially grouped and aggregated. SUM, COUNT, MIN, and MAX all have these required characteristics. However, user-defined CLR aggregates might not have the necessary properties.
6. Use GROUPING SETS
The new GROUPING SETS feature in SQL Server 2008 can enable you to avoid redundant computation and shorten your queries. For example, you want to create a report that shows sales totals for a year, quarter, and country, in addition to country totals, period totals, and a grand total. A typical way to display this information would be a report with the format shown in the following figure.
Figure 1: Pivot table report layout suitable for displaying data retrieved by GROUPING SETS query
The GROUP BY CUBE notation in SQL is not sufficient to express this query because the period of interest is represented by two columns, not one. The new GROUPING SETS feature enables you to specify all sets of grouping columns you are interested in, in a single relatively simple query as follows:
USE AdventureWorksDW;
SELECT D.CalendarYear, D.CalendarQuarter, T.SalesTerritoryCountry,
SUM(F.SalesAmount) AS SalesAmount
FROM dbo.FactResellerSales F
INNER JOIN dbo.DimTime D ON F.OrderDateKey = D.TimeKey
INNER JOIN dbo.DimSalesTerritory T ON
F.SalesTerritoryKey = T.SalesTerritoryKey
WHERE D.CalendarYear IN (‘2003’, ‘2004’)
GROUP BY GROUPING SETS (
(D.CalendarYear, D.CalendarQuarter, T.SalesTerritoryCountry),
(D.CalendarYear, D.CalendarQuarter),
(T.SalesTerritoryCountry),
()
);
Without using GROUPING SETS, the simplest way to write this query is as the union of several SELECTs, one for each grouping set, similar to the following:
SELECT D.CalendarYear, D.CalendarQuarter, T.SalesTerritoryCountry,
SUM(F.SalesAmount) AS SalesAmount
FROM dbo.FactResellerSales F
INNER JOIN dbo.DimTime D ON F.OrderDateKey = D.TimeKey
INNER JOIN dbo.DimSalesTerritory T ON
F.SalesTerritoryKey = T.SalesTerritoryKey
WHERE D.CalendarYear IN (‘2003’, ‘2004’)
GROUP BY D.CalendarYear, D.CalendarQuarter, T.SalesTerritoryCountry
UNION ALL
SELECT D.CalendarYear, D.CalendarQuarter, NULL,
SUM(F.SalesAmount) AS SalesAmount
FROM dbo.FactResellerSales F
INNER JOIN dbo.DimTime D ON F.OrderDateKey = D.TimeKey
INNER JOIN dbo.DimSalesTerritory T ON
F.SalesTerritoryKey = T.SalesTerritoryKey
WHERE D.CalendarYear IN (‘2003’, ‘2004’)
GROUP BY D.CalendarYear, D.CalendarQuarter
UNION ALL
SELECT NULL, NULL, T.SalesTerritoryCountry,
SUM(F.SalesAmount) AS SalesAmount
FROM dbo.FactResellerSales F
INNER JOIN dbo.DimTime D ON F.OrderDateKey = D.TimeKey
INNER JOIN dbo.DimSalesTerritory T ON
F.SalesTerritoryKey = T.SalesTerritoryKey
WHERE D.CalendarYear IN (‘2003’, ‘2004’)
GROUP BY T.SalesTerritoryCountry
UNION ALL
SELECT NULL, NULL, NULL,
SUM(F.SalesAmount) AS SalesAmount
FROM dbo.FactResellerSales F
INNER JOIN dbo.DimTime D ON F.OrderDateKey = D.TimeKey
INNER JOIN dbo.DimSalesTerritory T ON
F.SalesTerritoryKey = T.SalesTerritoryKey
WHERE D.CalendarYear IN (‘2003’, ‘2004’)
This second form of the query is of course much longer, and the query plan generated for it is less efficient, with a run time about 2.5 times longer than the GROUPING SETS form.
Reference to Book online of Microsoft