Skip to content

ETL tips for Large Scale Data warehouse

September 9, 2011

An important part of ETL is index maintenance. Index creation and maintenance both can be time consuming. Depending on the amount of data you will change or add during your ETL, you may wish to modify the existing indexes in place (by using DML statements) or drop the indexes, do your load, and then rebuild the indexes. If you are changing a very large number of rows, the drop-and-rebuild approach is typically better. Otherwise, maintaining the indexes in place is preferred. Consider experimenting with both approaches to find the best one for your situation.

SQL Server 2008 and SSIS 2008 provide some definite advantages over the previous release. In this section, the new features and advantages speeding up ETL processes are briefly highlight

Pipeline Performance

At run time, the SSIS engine converts the abstraction (graph composed of linked data sources, transformations, and destinations) into a series of pipelines that move the data down the graph, while keeping the data in memory buffers (to avoid staging) and minimizing the number of data-copy operations. Because of these optimizations, the SSIS engine can efficiently process data even within an arbitrarily/randomly complex graph.

In SQL Server 2005 the pipeline execution engine allocates worker threads in an affinitized manner, meaning that once a thread has finished its work it is not able to take on any other pending work.

In SQL Server 2008, the resource assignment uses a thread-pooling mechanism whereby a thread can be dynamically assigned to a different task as soon as it has completed its current work. This feature is known as Pipeline Performance since the net result is that the SSIS pipeline engine makes more optimal use of the available hardware, leading to predictable performance, increased parallelism, faster load times, and a higher return on investment. SQL Server 2008 takes four times as fast as SQL Server 2005 does for the same package running on the same hardware

Lookup Performance

The Lookup component in SSIS tests whether each row in a stream of rows has a matching row in another dataset. A lookup is akin to a database join operation (specifically a relational hash join); the key difference is that the comparison takes place outside the context of the relational engine. Data from one source is hashed into an in-memory cache within a Lookup component and data from the other source is streamed past the Lookup so that specified keys can be compared. This is useful because the two datasets may come from completely different (and possibly non-relational) data sources.

Lookup in SQL Server 2008 Integration Services has several major advantages over the previous version:

The in-memory cache can be loaded from any source, unlike Integration Services in SQL Server 2005, which only permitted OleDb sources. For instance, the cache can now be loaded from a flat file, ADO.NET, OleDb, ODBC, a Visual Basic®.Net or C#® script, or even a separate pipeline. This lessens the amount of staging in a solution, and increases the ultimate flexibility of the solution.

The cache content can be located in virtual memory or persisted to permanent file storage. This means that within the same package, multiple lookup components can share the same cache. If the cache is saved to disk, it can be shared across different packages. The cache file format is optimized for speed and access to it can be orders of magnitude faster than reloading the reference dataset from the original relational source.

The cache of lookup has 3 options:

  • The miss/partial-cache: Loading into cache the key values that have no matching entries in the reference data set. This reduces a redundant and expensive trip to the database. The miss-cache feature alone can contribute up to a 40 percent performance improvement. Partital cache is used when reference table is large and/or referenced data set is referred to a small number of rows of reference table.
  • The full cache: All data values are loaded into the cache in advance and no caching of key values. Full caching requires that the entire key value set fit into memory. This increases performance if the reference table is small and fixed within the memory.
  • The no caching: This is only recommended if it is evident that there will hardly/seldom ever be any repeated lookups.

Other enhancements of the lookup component include optimized I/O routines leading to faster cache loading and lookup operations. The lookup is also improved error handling and reporting, and the ability to make changes to quey at run time. Users are encouraged to take advantage of the ability to reuse the lookup cache across packages.

ETL Features in the SQL Server Engine

Through the following features are not provided directly by SSIS but rather by the SQL Server 2008 Database Engine. They help deliver high performance ETL solutions

Change Data Capture

By using the change data capture feature, ETL solutions can be designed to be much more robust and deliver increased performance and scalability. Lowering the cost of change capture by using change data capture can lower ETL processing time significantly, shortening your batch window, increasing availability of your data warehouse, and reducing ETL hardware costs. Change data capture can also enable “trickle feed” scenarios without the need for changes to your operational applications

Merge

The merge clause enables you to specify how each operation type should occur—including advanced semantics such as SCD-2 (slowly changing dimension Type 2) behavior.

The common problem that this operator solves is that when the source system provides a batch of rows, some of which are new (inserts) and some of which already exist in the destination but have different values (updates), there is no way to distinguish the respective types. One solution would be to use an SSIS Lookup operation to figure out which rows were new and which already existed. The easiest option to accomplish this is by using the Slowly Changing Dimensions wizard available in SSIS 2008, which automatically configures the proper packages for performing the merge operation. However, this approach may not scale well since the Lookup component requires that one side of the join is explicitly cached in memory. Although that sounds fine for simple cases, if the data to be cached is large and/or wide, the overhead and time of physically loading the data into memory as well as the hardware requirements might make this approach untenable. The other solution would be to join the source and destination within the context of a relational update operation (thereby updating the existing rows) followed by an insert operation that uses a left outer join to determine which rows to insert (to insert the new rows). The bottleneck in this solution is that the two operations must be done serially and you incur the expense of two joins.

In SQL Server 2008, the merge operator solves this type of issue. The merge operator enables you to specify how each row from the source should be treated in relation to the destination. Since it uses only one (internal) join, it offers increased performance, and coupled with granular syntax options it delivers a powerful solution.

For example, the following Transact-SQL script based on the AdventureWorks and AdventureWorksDW sample databases shows how a Customer dimension table can be updated or have rows inserted into it by using a single MERGE statement.

USE AdventureWorksDW;

GO

MERGE dbo.DimCustomer AS [Dest]

USING (

       SELECT

              ContactID,

              N’AW’ + RIGHT(N’0000000′ + CONVERT(NVARCHAR(10), ContactID), 8) AS [Key],

              FirstName,

              LastName,

              Phone

       FROM AdventureWorks.Person.Contact

) AS [Source]

ON [Dest].CustomerAlternateKey = [Source].[Key]

WHEN MATCHED THEN UPDATE SET

       [Dest].FirstName = [Source].FirstName,

       [Dest].LastName = [Source].LastName,

       [Dest].Phone = [Source].Phone

WHEN TARGET NOT MATCHED THEN INSERT (

       GeographyKey, CustomerAlternateKey, FirstName,

       LastName, Phone, DateFirstPurchase

) VALUES (

       1, [Source].[Key], [Source].FirstName,

       [Source].LastName, [Source].Phone, GETDATE()

);

Minimally Logged Insert

Minimally logged insert can have a large impact on the load performance of the ETL process. Rather than writing data to the log and then to disk as is required for recovery under write-ahead logging, it is possible to write the data to disk only once if full recovery is not desired, for example, when inserting large amounts of data into a table, such as during ETL.

Minimal logging was introduced in SQL Server 2005 and refers to logging only the information that is required to roll back the transaction without supporting point-in-time recovery. Minimal logging is only available under the bulk logged and simple recovery models. Operations that can be minimally logged in SQL Server 2005 include bulk import operations, SELECT INTO, and index creation and rebuild. SQL Server 2008 extends the optimization to INSERT INTO…SELECT FROM Transact-SQL operations that insert a large number of rows into an existing table under either of the following conditions:

  • Inserting into an empty table that has a clustered index and no non-clustered indexes
  • Inserting into a heap that has no indexes but that can be non-empty 
Did Windows Server 2008 figure in to this?

A lot of innovative engineering work in Windows Server 2008, including significant improvements in memory management, PCI and block storage I/O, and core networking, helped achieve this great performance. Because of these advances, Windows Server 2008 sustained about 960 megabytes per second over the Ethernet network, during processing of one large table.

When you execute SQL statements within Integration Services (as noted in the above Data access mode dialog box), whether to read a source, to perform a look transformation, or to change tables, some standard optimizations significantly help performance: 

• Use the NOLOCK or TABLOCK hints to remove locking overhead. 

•To optimize memory usage, SELECT only the columns you actually need. If you SELECT all columns from a table (e.g., SELECT * FROM) you will needlessly use memory and bandwidth to store and retrieve columns that do not get used.   

•If possible, perform your datetime conversions at your source or target databases, as it is more expensive to perform within Integration Services.  

•In SQL Server 2008 Integration Services, there is a new feature of the shared lookup cache. When using parallel pipelines (see points #8 and #10 below), it provides a high-speed, shared cache.   

•If Integration Services and SQL Server run on the same server, use the SQL Server destination instead of the OLE DB destination to improve performance.  

•Commit size 0 is fastest on heap bulk targets, because only one transaction is committed. If you cannot use 0, use the highest possible value of commit size to reduce the overhead of multiple-batch writing. Commit size = 0 is a bad idea if inserting into a Btree – because all incoming rows must be sorted at once into the target Btree—and if your memory is limited, you are likely to spill. Batchsize = 0 is ideal for inserting into a heap. For an indexed destination, I recommend testing between 100,000 and 1,000,000 as batch size.  

•Use a commit size of <5000 to avoid lock escalation when inserting; note that in SQL Server 2008 you can now enable/disable lock escalation at the object level, but use this wisely.  

•Heap inserts are typically faster than using a clustered index. This means that you may want to drop indexes and rebuild if you are changing a large part of the destination table; you will want to test your inserts both by keeping indexes in place and by dropping all indexes and rebuilding to validate. 

•Use partitions and partition SWITCH command; i.e., load a work table that contains a single partition and SWITCH it in to the main table after you build the indexes and put the constraints on. 

Plan for capacity by understanding resource utilization.

SQL Server Integration Services is designed to process large amounts of data row by row in memory with high speed. Because of this, it is important to understand resource utilization, i.e., the CPU, memory, I/O, and network utilization of your packages.

CPU Bound
Seek to understand how much CPU is being used by Integration Services and how much CPU is being used overall by SQL Server while Integration Services is running. This latter point is especially important if you have SQL Server and SSIS on the same box, because if there is a resource contention between these two, it is SQL Server that will typically win – resulting in disk spilling from Integration Services, which slows transformation speed.

The perfmon counter that is of primary interest to you is Process / % Processor Time (Total). Measure this counter for both sqlservr.exe and dtexec.exe. If SSIS is not able to drive close to 100% CPU load, this may be indicative of:

  • Application contention: For example, SQL Server is taking on more processor resources, making them unavailable to SSIS.
  • Hardware contention: A common scenario is that you have suboptimal disk I/O or not enough memory to handle the amount of data being processed.
  • Design limitation: The design of your SSIS package is not making use of parallelism, and/or the package uses too many single-threaded tasks.

Network Bound
SSIS moves data as fast as your network is able to handle it. Because of this, it is important to understand your network topology and ensure that the path between your source and target have both low latency and high throughput.

The following Network perfmon counters can help you tune your topology:

  • Network Interface / Current Bandwidth: This counter provides an estimate of current bandwidth.
  • Network Interface / Bytes Total / sec: The rate at which bytes are sent and received over each network adapter.
  • Network Interface / Transfers/sec: Tells how many network transfers per second are occurring. If it is approaching 40,000 IOPs, then get another NIC card and use teaming between the NIC cards.

These counters enable you to analyze how close you are to the maximum bandwidth of the system. Understanding this will allow you to plan capacity appropriately whether by using gigabit network adapters, increasing the number of NIC cards per server, or creating separate network addresses specifically for ETL traffic.

I/O Bound
If you ensure that Integration Services is minimally writing to disk, SSIS will only hit the disk when it reads from the source and writes to the target. But if your I/O is slow, reading and especially writing can create a bottleneck.

Because tuning I/O is outside the scope of this technical note, please refer to Predeployment I/O Best Practices.  Remember that an I/O system is not only specified by its size ( “I need 10 TB”) – but also by its sustainable speed (“I want 20,000 IOPs”).

Memory bound
A very important question that you need to answer when using Integration Services is: “How much memory does my package use?”

The key counters for Integration Services and SQL Server are:

  • Process / Private Bytes (DTEXEC.exe) – The amount of memory currently in use by Integration Services. This memory cannot be shared with other processes.
  • Process / Working Set (DTEXEC.exe) – The total amount of allocated memory by Integration Services.
  • SQL Server: Memory Manager / Total Server Memory: The total amount of memory allocated by SQL Server. Because SQL Server has another way to allocate memory using the AWE API, this counter is the best indicator of total memory used by SQL Server. To understand SQL Server memory allocations better, refer to Slava Ok’s Weblog.
  • Memory / Page Reads / sec – Represents to total memory pressure on the system. If this consistently goes above 500, the system is under memory pressure.

 

Reference to Book online of Microsoft

Advertisements

From → ETL

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

%d bloggers like this: