Skip to content

Data Warehouse Design with Agile methodology

July 7, 2015
tools that support database impact analysis and automated testing???
trust dimensional design patterns
data stories
A business process is a collection of linked tasks which find their end in the delivery of a particular service or product to a client. A business process has also been defined as a set of activities and tasks that, once completed, will accomplish a specific organizational goal.
Business events are the measurable atomic details of  business processes
Business events are the individual actions performed by people or organizations during the execution of business processes
When customers buy products or use services, brokers trade stocks, and suppliers deliver components, they leave behind a  trail of business events within  the operational databases of  the organizations involved. These business events contain the atomic-level measurable details of the business processes that DW/BI systems are built to evaluate.
  • Agile data modelling uses business events as incremental units of data discovery/data model
  • Discovering a business event and documenting its data stories in table that is easily understood by business stakeholders and readily translated into logical/physical dimensional models.
    • Data stories and story types: discrete, recurring and evolving

CustomerOrdersTable

  • How to use  the 7Ws (who, what, when, where, how many, why, and how) to get stakeholders thinking dimensionally about their business processes, and describing the information that will become the dimensions and facts of their data warehouse
    •  gets BI stakeholders to think beyond their current reporting requirements by asking them to describe data stories: narratives that tease out the dimensional details of the business activity they need to measure.
    • Discovering business events: asking “Who does what?”
    • Documenting events
    • Describing event details: using the 7Ws and stories themes
    • Modelstorming with whiteboards: practical collaborative data modeling
    • tables help engage stakeholders who would rather define reports that answer their specific business questions than do data modeling
Operational systems support the EXECUTION of business processes
  • – The cardinality of each relationship is the number of possible matching values on either side of the relationship (crow’s feet for many, | for one, 0 for zero or optionality)
  • – Within a relational database, entities are implemented as tables and their attributes as  columns. Relationships  are  implemented  either  as  columns within  existing tables or as additional tables depending on their cardinality. One-to-one (1:1) and many-to-one (M:1) relationships are implemented as columns, whereas many-to-many  (M:M)  relationships  are  implemented  using  additional  tables,  creating additional M:1 relationships.

Data warehouses support the EVALUATION/MEASURE of business processes in terms of measurements (facts) and descriptions (dimensions), which can be used to filter, group, and aggregate the measurements.

  • – DW must be optimized for query processing and ease of use.
  • – Data cubes are often used to visualize simple dimensional models.

Star Schemas are used to VISUALIZE dimensional models and support agile, incremental BI

  • – The fact table contains facts: the numeric (quantitative) measures of a business event
  • – The fact table also contains dimensional FKs, M:M relationship between the dimensions.
  • – The subset of dimensional FKs form a composite PK for the fact table and defines its GRANULARITY/level/detail
  • – The most useful facts are ADDITIVE measures that can be aggregated using any combination of the available dimensions.
  • – The dimension tables contain mainly textual (qualitative) descriptions of the event and provide the context for the measures; dimensional ATTRIBUTE refers to a column in a dimension table
  • – Not all dimensional attributes are text. Dimensions can contain numbers and dates too, but these are generally used like the textual attributes to filter and group the facts rather than to calculate aggregate measures.

Data Warehousing/Business Intelligence (DW/BI) needs new techniques that can revolutionize BI requirements analysis, that are Agile techniques, which addresses both data warehouse design and BI application development.

  • – the early and frequent delivery of working software and responding to change
  • – emphasis on collaboration and incremental development
  • – using 7 W dimensional types – who, what, when, where, many, why, and how – to describe the business events measured
  • – support model-storming by differing radically from conventional entity-relationship ER based model
  • (Business) Event: using grammars (subjects, verbs, and objects) to discover business events and tell data stories
  • (Event) details: Using the 7Ws to discover atomic-level event details. Using prepositions to connect the (event) details to (business) events, and data story themes => story types (discrete, recurring, and involving) and potential fact table granularity
  • Agile Data Warehouse Design
    • Traditional data warehousing follows a near-serial or waterfall approach to design and development
      • with Big Design Up Front BDUF

WaterFallDWDevelopment

  • Dimensional modeling enables incremental development: reduce the risks of pure waterfall by allowing developers to release early incremental BI functionality one star schema at a time, get feedback and make adjustments. But still inherent limitations and initial delays as traditional data warehouse approach
    • with “Big Requirement Up Front” BRUF  preceding BDUF data modelling
  • Agile data warehousing is highly iterative and collaborative: further reduce the risks associated with upfront analysis and provide even more timely BI value by taking a highly iterative, incremental and collaborative approach to all aspects of DW design and development
    • Avoide the BDUF and doing “Just Enough Design Upfront” JEDUF in the initial and Just In Time JIT detailed design with each iteration
  • Agile Data Modelling

AgileDWDevelopment

  • (www.agiledata.org) defines agile data modeling as follows:  “Data modeling is the act of exploring data-oriented structures. Evolutionary data modeling is data modeling performed in an iterative and incremental manner. Agile data modeling is evolutionary data modeling done in a collaborative manner.”
  • Collaborative modeling combines analysis and design and actively involves stakeholders
    • Collaborative data modeling obtains data requirements by modeling directly with stakeholders. It effectively combines analysis and design and ‘cuts to the chase’  of  producing a data model  (working  software  and  documentation) rather than ‘the establishing shot’ of recording data requirements (only documentation).
    • Incremental data modeling gives you more data requirements when they are better understood/needed by stakeholders, and when you are ready to implement them. Incremental modeling and development are scheduling strategies that support early and frequent software delivery.
    • Iterative data modeling helps you to understand existing data requirements better and improve existing database schemas through refactoring: correcting mistakes and adding missing attributes which have now become available or important. Iterative modeling and development are rework strategies that increase software value.
  • Agile DIMENSIONAL Modeling
    • Agile dimensional modeling focuses on business processes rather than reports
      • focus on the far smaller (finite) number of relatively stable business processes that stakeholders want to measure now or next INSTEAD of trying to discover the right reports
    • Agile dimensional modeling creates flexible, report-neutral designs
      • Avoid decoding detailed business events from the current report definitions
    • Agile dimensional modeling enables proactive DW/BI to influence new operational systems
      • Proactive agile dimensional modeling enables BI stakeholders to define new business processes from a measurement perspective and provide timely BI input to operational application development or package configuration.
    • Evolutionary modeling supports accretive BI requirements
      • By following hands-on BI prototyping and/or real BI usage,  iterative and  incremental dimensional modeling allows stakeholders to (re)define their real data requirements.
    • Collaborative modeling teaches stakeholders to think dimensionally
      • Stakeholders involved in collaborative modeling have opportunities to think dimensionally about their data and define common, conformed dimensions and fact from the outset
    • Collaborative modeling creates stakeholder pride in the data warehouse

DIMENSIONAL DESIGN PATTERNS: the 7Ws Framework

  • Who is involved?
    What did they do? To what is it done?
    When did it happen?
    Where did it take place?
    Why did it happen?
    HoW did it happen – in what manner?
    HoW many or much was recorded – how can it be measured? Represent FACTs
  • – Organize design patterns around the 7W dimensional types discovered. By using the 7Ws to examine the complexities of modeling customers and employees (who), productions and services (what), time (when), location (where), business measures (how many), cause (why), and effect (how)
  • –  Fact tables represent VERBS (they record business process activity). The facts they contains and the dimensions that surround them are NOUNs, each classifiable as one of the 7Ws. 6Ws (WHO, WHAT, WHERE, WHEN, WHY (cause) AND HOW (effect)) represent dimension types. HOW MANY/MUCH represents facts
  • – Star schemas usually contain 8-12 dimensions because of the 6Ws and each can appear multiple times.
  • The benefit of process-oriented dimensional modeling  is  that  it naturally breaks data warehouse scope, design and development into manageable chunks consisting of just the individual business processes that need to be measured next.
  • Each business process as a separate star schema supports incremental design, development and usage. Agile dimensional modelers and BI stakeholders can concentrate on one business process  at  a  time
  •  Agile development teams can build and incrementally deliver individual star schemas earlier than monolithic designs. Agile BI users can gain early value by analyzing  these business processes  initially  in  isolation and  then grow into more valuable, sophisticated cross-process analysis.

Who and What: People and Organizations, Products and Services

  • – exclusive attributes
  • – swappable dimensions
  • – recursive hierarchies

When and Where: Time and Location

  • – separate calendar dimensions and clock dimensions
  • – Year-to-date (YTD) analysis: using fact state tables and fact-specific calendars
  • – Multinational calendars: by time and location
  • – modeling movement: events with additional time and location dimensions

How Many/Much: Facts and Measures and KPIs

  • – Design fact tables for performance and ease of use with 3 BASIC DESIGN PATTERNS: transactions, periodic snapshots, and accumulating snapshots. Using event timelines to model accumulating snapshots as evolving events
    • Factless fact tables: not contain any ACTUAL measures, only contain dimension FKs, where to do COUNT, GROUP BY the desired dimension attributes
    • Transaction fact tables: capture the details of individual state changes/each event (HOW MANY)
    • Accumulating (snapshot) fact tables – HOW efficiencies – DAILY process
      • Standard Accumulating snapshot fact tables: update (measured events: date, location, status, etc.) when the state of the pipeline changed. The grain is each item measured. The accumulating snapshot will have numerous date keys, such as date released by assembly, date shipped to distribution center, date received at distribution center, and so on until date of final sale. Fact table metrics will include a series of date lags that measure the time it took for a vehicle to move between pipeline steps .
        • Example: How quickly do vehicles move through the pipeline? What’s the average length of time from assembly release to final customer sale?
      • Timespan accumulating snapshot fact tables: add a new row (instead of updating) when something about the pipeline changed
    • Periodic snapshot fact tables: summarize events occurring over a period (day, week, month…). The grain is the period. The table often contain many facts (value) because any measurement event consistent with this table grain is permissible. These fact tables are uniformly dense in their foreign key as even if no activity happens during the period, a row is typically inserted into the fact table containing a zero/NULL for each fact (value) – HOW MUCH
  • – Providing the basis for flexible measures and KPIs: Defining atomic-level additive facts. Documenting semi-additive and non-additive facts, and understanding their limitations.
  • – Fact table performance optimization: Using indexing, partitioning, and aggregation to improve fact table ETL and query performance
  • – Cross-process analysis: Combining the results from multiple fact tables using drill-across processing and multi-pass queries. Building derived fact tables and consolidated data marts to simplify query processing

Why and How: Cause and Effect

  • Degenerate dimension: is a dimension key/field (NOT dimension data) in the fact table that does not have its own dimension table.
  • – Modeling causal factors: Using promotions, weather, and other causal dimensions to explain WHY events occur and WHY FACTS vary. Using TEXT dimensions to handle unstructured reasons and exception descriptions.
  • – Modeling event descriptions: using HOW dimensions to collect any additional descriptions of an event. Consolidating excessive DEGENERATE dimensions as HOW dimensions, and combining small WHY and HOW dimensions
  • – Multi-valued dimensions: Using BRIDGE tables and weighting factors to handle fact allocation (splitting the atom) when dimensions have multiple values for each atomic-level fact. Using optional BRIDGE tables and multi-level dimensions to efficiently handle barely multi-valued dimensions. Using PIVOTED dimensions to support complex multi-valued constraints
  • – Providing additional HOW dimensions: Using STEP dimensions for understanding sequential behavior, audit dimensions for tracking data quality/lineage, and range band dimensions for treating facts as dimensions.

Data Warehouse Analysis and Design

The two approaches commonly used to obtain data warehousing requirements are data-driven analysis (also known as supply driven) and reporting-driven analysis (also known as demand driven). While most modern data warehousing initiatives use some combination of the two.

DWAnalysis

  • Data-Driven Analysis
    • Pure data-driven analysis is often influenced by the OLTP perspective of the source data, especially when the operational systems are packaged applications, such as ERP systems built on highly GENERIC data models.
    • Pure data-driven analysis avoided early user involvement, which leads to DW designs that did not met BI user need.
    • Without user input to prioritize data requirements and set a manageable scope, these early data warehouse designs were time-consuming and expensive to build.
  • Reporting-Driven Analysis
    • Reporting requirements are gathered by interviewing potential BI users  in small groups
    • Reporting-driven analysis focuses the data warehouse design on efficiently prioritizing the stakeholder’s most urgent reporting requirements and can lead to timely,  successful deployments when the scope is managed carefully.
    • Unfortunately, to gather ALL the reporting requirements, the interview requires to attain an enterprise or even a cross-department perspective so it is very time-consuming.
    • Getting stakeholders to think beyond “the next set of reports” and describe longer term requirements in sufficient detail takes CONSIDERABLE interviewing skills.
    • Focusing too CLOSELY on current reports alone leads to inflexible dimensional models/fail to handle changes in the long term because the designs match the current report request too STRICTLY rather than treating these reports as clues to discovering the underlying business processes. The designs might be suitable for satisfying the current reporting needs or individual DEPARTMENTS but unsuitable for Enterprise Data Warehouse design (for atomic-level detailed business processes rather than summary report requests).
  • Proactive DW/BI Analysis and Design
    • Benefits of Proactive Design for Data Warehousing: can easily be incorporated with OLTP data/systems
    • Challenges of Proactive Analysis for Data Warehousing: Proactive analysis takes place before any real data (on sources) exists; rely even less upon traditional analysis techniques to provide BI data requirements
    • Proactive Reporting-Driven Analysis Challenges
      • Traditional  interviewing  techniques  for  gathering  reporting  requirements  are problematic when stakeholders haven’t seen the data or applications that will fuel their BI  imagination.
    • Proactive Data-Driven Analysis Challenges
      • This approach is not efficient if only a small percentage of data/even no data is populated and well understood (require too long to have large data set or understand this data)
    • Data then Requirements: a ‘Chicken or the egg’ Conundrum
      • proactive data warehousing needs a new approach to database analysis and design: not your father’s data modeling, not even your father’s dimensional modeling

Agile Dimensional Modeling and Traditional DW/BI Analysis

Agile dimensional modeling doesn’t completely replace traditional DW/BI analysis tasks, but by preceding both data-driven and reporting-driven analysis it can make them  agile  too:  significantly  reducing  the work  involved while  improving  the quality and value of the results.

  • Agile Data-Driven Analysis
    • streamlined by targeted data profiling. Only the data sources implicated by the agile data model need to be analyzed within each iteration
  • Agile Reporting-Driven Analysis
    • take the form of BI prototyping. The early delivery of dimensional database schemas enables the early extraction, transformations and loading (ETL) of real sample data so that better report requirements can be prototyped using the BI user’s actual BI toolset rather than mocked-up with spreadsheets

Requirements for Agile Dimensional Modeling

Agile modeling requires both IT and business stakeholders to change their work practices and adopt new tools and techniques:

  • Collaborative modelers require techniques that encourage interaction: simple frameworks, checklists, guidelines
  • Collaborative data modeling must use simple, inclusive notation and tools : ER diagrams are too complex or too abstract to Business stakeholders => agile modelers need to create less abstract, more inclusive data models using simple tools that are easy to use and easy to share.These inclusive models must easily translate into the more technically detailed, logical and physical, star schemas used by DBAs and ETL/BI developers
  • Data modeling sessions (model-storms) need to be quick: hours rather than days: it is hard to schedule long meetings with all stakeholders and modelers there are many modeling sessions
  • Agile modelers must BALANCE JIT and JEDUF modeling to reduce design rework: JIT modeling for the current data requirement and JEDUF modeling for the big picture to help anticipate future iterations and reduce the amount of design rework.
  • Evolutionary DW development benefits from ETL/BI tools that support automated testing: Developers need to embrace DB changes to match users’ real requirements. To respond efficiently to evolutionary data warehouse design, agile ETL and BI developers need tools that support database impact analysis and automated testing.
  • DW designers must also embrace change and allow their models to evolve: refactor the DB design but must avoide to resorting the generic data model PATTERNS which support understandability and query performance. Agile data warehouse modelers need trust dimensional design patterns.
  • There are proven tools and techniques existed to address the majority of these agile modeling prerequisites: interactive,  inclusive,  quick,  supports  JIT  and JEDUF, and enables DW teams to embrace change

BEAM * table

BEAM * short codes: short (alphanumeric) codes (mainly); 2 letter abbreviation of data properties to the table to

  • Document dimensional attribute properties including history rules
  • Document fact properties including aggregation rules
  • Record data-profiling results and map data sources to requirements
  • Define physical dimensional models: fact and dimension tables
  • Generate star schemas

Data Model Types

Work with different types of models depending on the level of technical details captured or communicated for the audience/collaborators.

  • Conceptual data models (CDM): contain the least technical detail and are intended for exploring data requirements with non-technical stakeholders.
  • Logical data model (LDM): record more technical details without going down to the data base specific level
  • Physical data models (PDM) are used by DBAs to create database schemas for a specific DBMS
Advertisements
One Comment
  1. pete permalink

    Hi Elly,
    nice blog on agile in EDW!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

%d bloggers like this: