Skip to content

Key areas of ETL testing

December 26, 2014
  • Was all the expected data loaded from the source file? Bugs that result in dropped records are common. You must do a full inventory count and ensure that the right records made it into the right tables across the source landing area, staging area, and data warehouse. Were some records rejected because of missing data? Find the problems and work out why they occurred.
  • Were all the records loaded into the right place? If the transaction order gets out of sync you can end up with all the records being processed, but some data gets over written with the older data or in the wrong place. It’s important to verify that each data field from each source loads in the correct field / table and in the correct order.
  • Do you have duplicate data? If you don’t remove duplicate data then the analysis and aggregate reports could be skewed. Duplicates can happen due to the source system feeding duplicates or due to defects in the ETL processes. In either case, the duplicate check must happen as data moves across landing, staging, and finally into the data warehouse.
  • Do the data transformation rules work? You may receive data with different date formats, or using different codes or descriptions for the same thing. The transformation process has to homogenize that data correctly, so that it is stored consistently. Most data warehouses have many such transformation rules which must all be verified with sample data that represents all the various permutations / possibilities.
  • Have you lost any part of data? Maybe truncation has caused a problem because the source file was too big or changed the format. It is important to make sure data from source file did not get truncated as it gets processed from landing to staging to the data warehouse.
  • Is data being corrected and rejected correctly? Sometimes the data may be incomplete or unclear. It may be necessary to hold this data aside and move on with rest of the data. Such data must be logged and rejected for further clarification before it can be re-processed.
  • Is the system generating error logs? You need to know when data is not being processed and why. You also need a system that can gracefully handle something like a system crash or power outage without creating duplicates, losing records, or forcing the user to reprocess from the start (as much as possible).
  • Can the system handle the data load? Performance might not be an issue at first, but you have to think ahead. How much data will the system have to handle 2 years from now? What happens when system already has 5+ yeas of data and new data has to be processed? It must be able to cope with the expected load, not just now, but in the future too.

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: