Skip to content

Agile ETL Design Pattern

June 2, 2015
  1. Data warehouse systems have characteristics
    • Functional
    • Resilient: be able to quickly return to previous good condition
    • Efficient: good performance
    • Accuracy
  2. Agile ETL design pattern
    • A data warehouse is a system that extracts, cleans, conforms, and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making
      • The most visible part is “querying and analysis”
      • The most complex and time consuming part is “extracts, cleans, conforms, and delivers”
      • How complex is an ETL? 70-80% of BI (DI -Data Integration- or DW) projects are reliable on ETL process
    • Why Agile developing methodology? Because Data Warehouse is a rigid system while business requirements keep changing. So Agile are here to solve the problems in an interactive way. We can get some production-ready cold out here regularly and we are going to our client’s requirements on an ongoing basis. Therefore, Agile is the key for today data warehouse success but we have to plan for ETL design pattern as well
    • .ETL design pattern must be
      • Flexible: can be changed when the requirements changes with low cost rework, should use templates if any
      • Standards Based: help for integration/ team collaboration
      • Extensible: when needing to make some minor adjustments, should use templates if any
      • Simple:
      • Decoupled: separated
    • ETL Tool Key Features
      • Extract, Load: flexible on interfaces: from flat files, DBMS, XML data, XLS, web services, emails, web logs, FTP, unstructured data
      • Clean: lookups, valiadations, Filters, Translations
      • Transform: changing data structure, joins, (De)Normaization, Aggregation, Rollup, Sorting, Partitioning, Data De-duplication, ability to call external tools
      • Performance
        • Symmetric Multiprocessing (SMP)
          • Pipeline processing
          • Multi-threaded processing
        • Massively Paralled Processing (MPP)
          • Clustering
          • MapRduce
      • Development:
        • Reusable components
        • Impact Analysis / Data Lineage
        • Documentation
      • Manageability
        • Team collaboration
        • Transformation repository
        • Metadata repository
        • Development process (Dev -> test -> Prod)
        • Security
      • Runtime
        • Scheduler Automation
        • Recovery and Restart
        • Workflow
    • Why is ETL (system) important?
      • Remove mistakes and corrects data
      • Adjusts data from multiple sources to be used together (conforming)
      • Structures data to be usable by BI tools
  3. Phases of typical Data Integration Project
    1. Requirements
      • Functional
        • Input
        • Output
        • Output data format
        • Transformation logic
    2. Non-functional
      • Time restrictions
      • Availability
      • Frequency of update
      • Data Latency
      • How to handle erroneous records
      • Security requirements
    3. Planning
      • ETL system implementation to be planned properly
        • Time for implementation
        • Correctly prioritized
        • Thorough data analysis extremely important
        • Unforeseen data quality issues cause delays
        • Biggest risk is unexpected data quality issues
        • Communicate properly
      • Keep it simple
        • Do not try to save the world
        • If you think it can be done simply, do it simply
    4. Analysis (of ETL steps)
      • Extract: data syntax, data semantics, limitations/restrictions of source data, the expected data growth rate
      • Clear & Transform:
        • Which fields need to be validated: against which source? how to handle erroneous data?
        • What is the data flow: transformation logic, action on error: stop or process valid data only
          • the designer should be obliged to define a flow for the problematic data, i.e., the rows that violate integrity or business rule
        • Transformation restart ability
          • small data volumes – transaction based
          • huge data volumes – process in bulk mode
      • Load
        • What is the target schema?
        • What is the target data volume?
        • What are the history requirements? SCD type 1, 2, or 3
        • Data syntax: what is a record, a record delimiter, a field delimiter, data length/ data type/format
        • Data semantics: what are the field names, data of a field
        • Limitations/restrictions: What are the data volumes, any impact on network, frequent export
    5. Implementation
      • Development
        • Enforce standardization
          • naming conventions
          • best practices
          • generating surrogate keys
          • looking up keys
          • applying default values
      • Testing
        • Review
        • Testing on Production data
        • Unit Testing
    6. Documentation:
      • Error handling
      • Support – monitoring checklist
      • Frequency of ETL processes runs
      • Data lineage: important to know and publish
  4. Typical issues
    1. Data Understanding
      • Source system (data)
        • Not documented
        • Documented incorrectly
        • Represent something else than they should
        • Data not clean
      • Transformation / Requirements
        • Not specified properly
        • Not specified at all
        • Initial analysis has not revealed issues/complexity
        • Requirements being changed
    2. Performance & Scalability
      • Performance: current and in 5 years?
      • Scalability
        • What is the data growth rate?
        • Are we testing on production data volumes?
      • Change Data Capture
        • Time consuming task
        • Issue on old systems
  5. Current Market Trends of Di/DW project
    1. Shift to semi-structured & Unstructured data: emails, documents, blogs
    2. Real-time processing: CRM, Zero-latency business, web services, SOA, JMS,…
    3. Cloud-mania: Cloud, cluster, Elastic Cluster, MapReduce, Apache Hadoop
Advertisements
Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: