Skip to content

SSIS: lookup component

November 8, 2011

The lookups performed by the Lookup transformation are case sensitive

Full Cache

lk_fullcache

The default cache mode for the lookup is Full cache. In this mode, the database is queried once during the pre-execute phase of the data flow. The entire reference set is pulled into memory. This approach uses the most memory, and adds additional startup time for your data flow, as all of the caching takes place before any rows are read from the data flow source(s). The trade off is that the lookup operations will be very fast during execution. One thing to note is that the lookup will not swap memory out to disk, so your data flow will fail if you run out of memory.

When to use this cache mode

  • When you’re accessing a large portion of your reference set
  • When you have a small reference table
  • When your database is remote or under heavy load, and you want to reduce the number of queries sent to the server

Keys to using this cache mode

  • Ensure that you have enough memory to fit your cache
  • Ensure that you don’t need to pick up any changes made to the reference table
    • Since the lookup query is executed before the data flow begins, any changes made to the reference table during the data flow execution will not be reflected in the cache

Partial Cache

lk_partialcache

In this mode, the lookup cache starts off empty at the beginning of the data flow. When a new row comes in, the lookup transform checks its cache for the matching values. If no match is found, it queries the database. If the match is found at the database, the values are cached so they can be used the next time a matching row comes in.

Since no caching is done during the pre-execute phase, the startup time using a partial cache mode is less than it would be for a full cache. However, your lookup operations would be slower, as you will most likely be hitting the database more often.

When running in partial cache mode, you can configure the maximum size of the cache. This setting can be found on the Advanced Options page of the lookup UI. There are actually two separate values – one for 32bit execution, and one for 64bit. If the cache gets filled up, the lookup transform will start dropping the least seen rows from the cache to make room for the new ones.

In 2008 there is a new Miss Cache feature that allows you to allocate a certain percentage of your cache to remembering rows that had no match in the database. This is useful in a lot of situations, as it prevents the transform from querying the database multiple times for values that don’t exist. However, there are cases where you don’t want to remember the misses – for example, if your data flow is adding new rows to your reference table. The Miss Cache is disabled by default.

When to use this cache mode

  • When you’re processing a small number of rows and it’s not worth the time to charge the full cache
  • When you have a large reference table
  • When your data flow is adding new rows to your reference table
  • When you want to limit the size of your reference table by modifying query with parameters from the data flow
  • If you selected Partial cache, on the Advanced page, set the following cache options:
    • From the Cache size (32-bit) list, select the cache size for 32-bit environments.
    • From the Cache size (64-bit) list, select the cache size for 64-bit environments.
    • To cache the rows without matching entries in the reference, select Enable cache for rows with no matching entries.
    • From the Allocation from cache list, select the percentage of the cache to use to store the rows without matching entries.

Keys to using this cache mode

  • Ensure that your cache size setting is large enough
  • Use the Miss Cache appropriately
  • If the cache size isn’t large enough for your rows, sort on lookup index columns if possible

No Cache

lk_nocache

As the name implies, in this mode the lookup transform doesn’t maintain a lookup cache (actually, not quite true – we keep the last match around, as the memory has already been allocated). In most situations, this means that you’ll be hitting the database for every row.

When to use this cache mode

  • When you’re processing a small number of rows
  • When you have non-repeating lookup indexes
  • When your reference table is changing (inserts, updates, deletes)
  • When you have severe memory limitations

Keys to using this cache mode

  • Ensure that the partial cache mode isn’t the better choice

Lookup and Cache Transforms in SQL Server Integration Services

Problem
One of the new SQL Server 2008 features in SSIS is an enhanced Lookup transform that includes the ability to cache the lookup data locally.   Can you provide the details of this new functionality and an example of how to use it?

Solution
The enhanced Lookup transform works hand-in-hand with the Cache transform.   The Cache transform is a brand new feature in SQL 2008, allowing you to cache the data used in the Lookup transform.  The Lookup transform can then utilize this cached data to perform the lookup operation.  Working with cached data will most likely be more efficient than querying the database.  In addition the Cache Connection Manager (another new feature) can persist the cache to a local file, allowing the cache to be shared between multiple SSIS packages and steps within a package.  A perfect example of where this will be useful is in the extract, transform and load (ETL) packages that we create to update a data warehouse.  We typically build dimension tables in the data warehouse that use a surrogate key as a primary key (in lieu of the source key in the business application).  As we are processing our fact tables we need to lookup the dimension surrogate keys based on the key in our source system and store the surrogate key in our fact tables.

To demonstrate how to use the new Lookup and Cache transforms, we will create a simple SSIS package as shown below:

The Build Customer Cache Data Flow task will use the new Cache Transform component to populate the cache with the source key and surrogate key for every row in the Customer dimension from the AdventureWorksDW 2008 database.  Note that we could create a separate SSIS package to populate our cache then simply use the cache in multiple SSIS packages.  We could run the SSIS package only when there has been a change to the Customer dimension.  For purposes of demonstration, we will just use a single package.

The Lookup Customer Surrogate Keys Data Flow task will use the enhanced Lookup transform to lookup the Customer dimension surrogate key based on the source key in the Sales.Customer table of the AdventureWorks2008 OLTP database.

Initial Setup

Before we walk through the details of the two Data Flow tasks, let’s first setup the Cache Connection Manager.  The Cache Connection Manager is a new connection manager which is used to define the cache.  To add a Cache Connection Manager right click in the Connection Managers area of the Control Flow design surface and select New Connection.  Select CACHE from the Add SSIS Connection Manager dialog then click the Add button:

After adding the Cache Connection Manager there are some properties that need to be configured.  On the General tab click the Use file cache checkbox and enter a filename for the cache, the filename of this connection manager can be configured by variable such as [@User::CacheConnection] + “Customer.caw”; this will persist the cache to a local file, allowing it to be reused on a subsequent run of this package or another package.

Click the Columns tab to configure the data columns to be stored in the cache.  SourceKey is the name we are using for the lookup column; i.e. it is the column that is joined to the source table to perform the lookup; the Index Position must be 1.  The SurrogateKey column is the value that we want to retrieve via the lookup; i.e. it’s the value that we want to retrieve based on our SourceKey; its Index Position must be zero.  The Columns dialog below assumes a single column lookup; if more than one column is required for the lookup, add the additional columns and increment the Index Position as appropriate.

Build Customer Cache

The Build Customer Cache Data Flow is shown below:

Get Customer Source Keys and Surrogate Keys is an OLE DB Source that selects the CustomerAlternateKey (i.e. the SourceKey) and the CustomerKey (i.e. the surrogate key) for every row in the Customer dimension.  The SELECT statement aliases the columns to SourceKey and SurrogateKey for convenience; they will match the column names in the cache and the Mappings dialog in the Cache transform (shown below) will automatically perform the mapping.

Load Customer Lookup Cache is a Cache Transform component that loads the cache.  The Cache Transform component requires a Cache Connection Manager which defines the cache.  We defined the Cache Connection Manager earlier in the Setup section.  Select the Cache connection manager on the Connection Manager page of the Cache Transformation Editor:

Click Mappings to map the input columns in the Data Flow to the destination columns in the cache.  Based on the aliases in our SELECT statement above, the mapping is performed automatically.

Lookup Customer Surrogate Keys

The Lookup Customer Surrogate Keys Data Flow task is shown below:

Get Orders is an OLE DB Source that retrieves the SourceKey by joining the Sales.SalesOrderHeader table to the Sales.Customer table.  The CAST is required to match the type in the cache and the Customer dimension.   The SourceKey is what we join to the cache to perform our lookup.

Lookup Customer Surrogate Keys is a Lookup transform.  It allows us to retrieve the SurrogateKey for the SourceKey in our Data Flow.  There are a number of properties to configure starting with the General page:

For Cache mode we select Full cache which preloads the cache before performing the lookups.  For Connection type we select Cache connection manager (a new feature) which allows us to preload the cache from a file that exists locally (we created it in the initial Data Flow task).  These two options maximize performance of the Lookup transform in most instances.  You can still choose an OLE DB connection manager which is essentially the functionality you get with SQL Server 2005.  Another new feature in SQL Server 2008 is the ability to redirect rows with no match to a new output (Lookup No Match Output) rather than redirecting these rows to the Error output.  In most cases a lookup with no match will need to be handled separately but it may not be treated as an error.

Click Connection to specify the Cache Connection Manager; we defined this in the Setup section.  It links the Lookup transform to the underlying cache.

Click Columns to configure the lookup operation; the completed dialog is shown below:

Start out by right clicking on the lookup column in the Available Input Columns list; for our example right click the AccountNumber column.  The Create Relationships dialog will be displayed; select AccountNumber as the Input Column and SourceKey as the Lookup Column.

Click SurrogateKey in the Available Lookup Columns list; this is the value that we want to retrieve from the lookup operation.

Save Customer Lookup Matches and Save Customer No Match Lookups are both OLE DB Destinations; we save the successful lookups and any source keys with no match in the cache to separate tables just to see the result of the Lookup transform.  In a typical SSIS package you would assign new surrogate keys for the No Match Lookups then insert all of the rows extracted into the appropriate fact table.  The No Match lookups represent what we call early arriving facts; i.e. a row from the source system that references a dimension source key that doesn’t exist yet.  For additional information on handling these early arriving facts, refer to our earlier tip Handling Early Arriving Facts in SQL Server Integration Services (SSIS) Packages.

Lookup – Using the cache connection manager

Best practices

  • Reuse the cache to reduce database load
  • Share the cache between lookups to reduce memory usage
  • Using the CCM is not always faster than OLEDB – the cost of disk access can out weight the benefits of pre-creating the cache
  • The cache is essentially clear text – do not store sensitive data inside of the cache
  • In terms of Cache Modes and the best practices that surround them, using a cache connection manager is equivalent to using a Full Cache mode

Reducing database and memory usage

If your reference database is remote, or under heavy load, consider using the Cache Connection Manager instead of an OLEDB connection.

Once a cache is used (or created) in an SSIS package, it will be kept in memory until the package has finished executing. The cache can be reused across multiple data flows, and shared between multiple lookups in the same data flow. It can also be persisted to disk, and reused across package executions.

image

Create the cache from any SSIS data source

To use the CCM, you need to create a lookup cache in a separate data flow using the Cache Transformation. Because the cache is created in a regular data flow, this means that you can now use any data source that SSIS can connect to as a source for your lookup reference (flat file, excel, SAP, etc).

With SSIS 2005, a common approach when using non-OLEDB accessible lookup sources was to stage the data first. If this data is only being used for your lookups, consider creating a persisted cache instead.

image

Cache the most common values

Sometimes you might have a large reference table, but the majority of your incoming data only uses a small portion of it. For example, you have a very large custom list, and the top 5% of your customers generate 90% of your transactions. In a scenario like this, you could pre-cache the information of your most active customers. Your data flow could use a cascading lookup pattern where you have one lookup which uses the cache, with its No Match output falling through to a second lookup running in a partial cache mode to hit the database to handle the remaining 10% of rows.

image

Advertisements

From → ETL

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: