Skip to content

Design database tips for Large Scale Data Warehouse

August 26, 2011

The section covers best practices for designing schemas and for the physical design of a database such as disk layout, indexes, compression, and summary aggregates.

1.Schema Design

Database schema design (logical design) can have a major impact on the performance of queries and updates. For best performance, and ease of understanding the data and writing queries, we recommend using a star schema

The data types you choose for columns have a performance impact. Use integer (4‑byte) surrogate key columns to join your fact table to your dimension tables in a star schema. These compress well and support fast comparisons during joins. For decimal measure values, use the money data type if possible because operations on it are faster than for general numeric or decimal types. Avoid the use of long fields of any type in your fact table unless it is essential to the behavior of your application. This helps keep your fact table smaller and usually speeds up query processing and data loading.

2.Physical Database Design

Physical design considerations include partitioning, indexing, creation of summary aggregates, compression, layout of tables and partitions on storage devices, and configuration of tempdb

Table Partitioning

With tables of billions of rows, it is important to be able to break them down into manageable-sized chunks for system management operations so that these operations complete in a reasonable time

For a data warehouse fact table beyond 50 GB, it is recommended that you partition it by time, such as by week. This allows you to bulk purge a week of data at a time with a metadata-only operation. It also can improve performance of loading, and management of summary aggregates maintained using indexed views

Note: SQL Server table partitioning is separate and distinct from the hardware server partitioning supported by the HP Integrity Superdome (hardware).

Index Design

A basic index design that is often effective for a star schema in SQL Server is to:

1.    Create an integer date key in the format of YYYYMMDD as the surrogate key for both the date dimension and a foreign key in the fact table. Create a clustered index including the date key column on the fact table. With the date key in this format, you can express date range filters conveniently and explicitly on the fact table. This in turn can give you better parallel query speed in some cases.

2.    Create nonclustered indexes on the fact table on the surrogate keys of the most frequently used dimensions.

3.    Create nonclustered indexes on large dimension tables (such as a Customer dimension with millions of rows) on the columns you search on frequently in those tables.

If your table partition width is one day, there is no need for the clustered index including the date key of the fact table in this situation. Instead, create a clustered index on the most frequently used dimension key column of the fact table besides the date key. However if you use partitioning to support a sliding-window scenario, partition the fact table and all its indexes on the date key.

Design of Summary Aggregates

Experienced data warehouse developers have long known that one of the best ways to get good data warehouse query performance is to use summary aggregates. These summarize data in the database into sets of rows that are typically much smaller than the fact table. If most of your queries can be answered from these aggregates, which are much smaller than the raw event data, query performance can be improved.

The primary forms of summary aggregates you can use with SQL Server are:

  • SQL Server Analysis Services cubes
  • SQL Server indexed views
  • User-defined summary tables

SQL Server Analysis Services (SSAS) is an online analytical processing (OLAP) system with features tuned for fast, convenient processing of decision support queries

If Analysis Services suits your reporting needs, Microsoft recommends that you use it in conjunction with your relational data warehouse to improve query performance.

If you wish to keep a relational-database-only configuration with SQL Server without using SSAS, or if you have a significant SQL decision-support query workload that could benefit from aggregates, consider using indexed views  or user-defined summary tables to accelerate your common aggregate queries.

Indexed views are automatically maintained by SQL Server when you update your database. If you need to be able to incrementally update your database, such as to correct errors, and have the aggregates be updated automatically, this makes indexed views a good choice. A significant improvement in SQL Server 2008 is partition aligned indexed views. This feature enables you to switch partitions of tables in and out even if those tables have indexed views defined on them.

Indexed views do have restrictions regarding what views can be indexed. If you want full flexibility and are able to maintain aggregates yourself in your ETL process, you may wish to use summary tables

Compression

Data compression, a new feature available in SQL Server 2008 Enterprise Edition, is extremely valuable in a scale-up environment. It can dramatically reduce I/O requirements by (a) reducing the amount of data that must be read from disk, and (b) increasing the percentage of the data that resides in main memory during normal operation. The second effect can be dramatic, speeding up some queries by an order of magnitude or more.

However compression increased CPU time required to load data. With page compression, CPU usage during loading can increase by about a factor of 2.5. Another form of compression, called row compression, results in smaller compression factors, but the cost to compress is significantly less.

Query processing time does not differ significantly between row and page compression. The actual benefits and costs of compression depend on your data, hardware, and workload. However, in large-scale internal tests on a 600‑GB and 6‑terabyte data warehouse, when using page compression, we observed a 30-40% improvement in query response time with a 10-15% CPU time penalty.

Compression can be applied to individual partitions. If you have a large table that is already partitioned, but the partitions are not compressed, and you do not have a large time window available in which you can compress the full table, consider compressing only the new data during ETL. Then over time, as you age out old partitions and add new ones, the entire data set will gradually become compressed, without the need to extensively re-organize the data all at once.

Microsoft SQL Server Customer Advisory Team engineers have determined that for most large data warehouses, once about 20 percent of the data is in main memory, little or no I/O needs to be done to process queries. So a best practice to:

  • Compress your fact tables by using the PAGE compression feature [Agar08].
  • Add enough memory to your system so that at least 10-20% of the compressed data will fit.

Disk Layout

Storage throughput is the most critical resource for a good data warehouse system. It is also one of the most expensive. A rule of thumb for any database system is that multiple small drives are better than one big one. The number of spindles you can apply to your storage system is probably the single most important factor in creating a high throughput IO subsystem.

Read-and-write caching in the disk devices or the SAN can be a big benefit up to a certain point. However, very large table scans can quickly fill this hardware-level cache and cause it to not give much benefit.

While a proper configuration of direct attached storage systems can equal or exceed the throughput ratings of the SANs, SAN systems have features that are highly beneficial for very large databases. The SAN snapshots, copy on write, and other similar features can make administrative tasks much easier.

To help you determine how good your disk layout is, download SQLIO.exe or IOMETER.exe. Run one of these tools to get a rating for both the number of IO/sec and throughput in bytes/sec. SQL Server uses a variety of block sizes for both reads and writes so it is important to run a variety of tests that will match your usage patterns.

There are three basic parameters to alter when running the I/O tests:  block size, read/write, and serial/random. For SQL Server, the minimum tests to run are 8‑KB and 64‑KB block sizes, both serial and random, and both read and write. Most of the ratings you find on the Web are stated in throughput of bytes/sec for the 64‑KB serial read test (which gives the highest throughput rating). A value of over 250 MB/sec per core for today’s computers is typically sufficient

There is one configuration setting that is common to most disk subsystems or host bus adapters (HBAs) that can be beneficial to change. The Queue Depth setting is normally 4, but for SQL Server data warehouse systems it should be set to a minimum of 32. If you have many Fibrechannel connections, use the maximum of 254. For instructions on how to adjust the Queue Depth setting, consult your storage hardware documentation about how to manage the disk controller configurations.

Make sure that the filegroups underneath each physical index or heap have adequate I/O bandwidth under them. A data warehouse fact table (or fact table partition if the table is partitioned) ideally should have at least 250 MB/sec I/O capacity per processor core if a significant part of your workload is I/O bound (limited by I/O performance).

tempdb Configuration: refer to tempdb configuration article

Reference to Book online of Microsoft

http://books.google.com.vn/books?id=jMAFcuyVSt0C&pg=PA34&lpg=PA34&dq=sample+star+schema+database+%2B+view+report+by+country+and+company&source=bl&ots=8_hgx-GkCZ&sig=DO2ZNfA3FN6qbfvKttQ0PS4ji5U&hl=vi&ei=RUiBTq_OBbGhiAemkvz3Dg&sa=X&oi=book_result&ct=result&resnum=2&ved=0CCAQ6AEwATgK#v=onepage&q=country&f=false

 Star Schema The Complete Reference By Christopher Adamson

+ Operational Systems

+ Analystic systems

+ Avoid mixed grain in fact tables

Chapter 6: More on Dimension tables

Chapter7: Hierarchy and SnowFlakes

Chapter 8: More Slow change techniques

Chapter 16: Design and Business Intelligence

Chapter 17: Deisgn ETL

Advertisements
Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: