Searching...
Tuesday, 10 August 2010

Data warehousing concepts

Information Technology (IT) has historically influenced organizational performance and competitive standing. The increasing processing power and sophistication of analytical tools and techniques have put the strong foundation for the product called data warehouse.

Definition - A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process.

A data warehouse is a relational database that is designed for query and analysis rather than transaction processing. A data warehouse usually contains historical data that is derived from transaction data. It separates analysis workload from transaction workload and enables a business to consolidate data from several sources. In addition to a relational database, a data warehouse environment often consists of an ETL solution, staging area, Operational data store, metadata, an OLAP engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.



ETL Solution – Performs data extraction, integration, transformation and loading operations
Staging area – Temporary location for integrating and summarize data
ODS (Operational Data Store) – Holds recent transactional data before migration to the data warehouse.
Metadata – Holds information about data stored in data warehouse.
Enterprise Data Warehouse – Central data repository for decision support throughout the enterprise.
Data Mart – Subset of data warehouse and supports a particular region, business unit or business function.
OLAP Engine – Retrieve data from warehouse and process for analysis.
Other BI Tools – Application software to report, analyze and present data.

Data warehouses and data marts are built on dimensional data modeling which is different from relational data modeling used for transactional (OLTP) type systems.  In a dimensional data modeling database can be visualized as a cube of several dimensions. A dimension of the cube represents a category of information with attributes and hierarchy.

Ex – In the Time dimension Year is the unique level (attribute) with following hierarchy (specification of levels that represents relationship between different attributes within a dimension).

Year –> Month  –> Day

A dimensional model includes fact tables and lookup tables.

Fact Table – A fact table is a table that contains the measures of interest. For example, sales amount would be such a measure. This measure is stored in the fact table with the appropriate granularity. For example, it can be sales amount by store by day. In this case, the fact table would contain three columns: A date column, a store column, and a sales amount column.

There are three types of facts -
Additive - Additive facts are facts that can be summed up through all of the dimensions in the fact table.
Semi-Additive - Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.
Non-Additive - Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.

Let us use examples to illustrate each of the three types of facts. The first example assumes that we are a retailer, and we have a fact table with the following columns -
Date
Store
Product
Sales_Amount

The purpose of this table is to record the sales amount for each product in each store on a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can sum up this fact along any of the three dimensions present in the fact table — Date, Store, and Product. For example, the sum of Sales_Amount for all 7 days in a week represents the total sales amount for that week.


Say we are a bank with the following fact table -
Date
Account
Current_Balance
Profit_Margin

The purpose of this table is to record the current balance for each account at the end of each day, as well as the profit margin for each account for each day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what’s the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information). Profit_Margin is a non-additive fact, for it does not make sense to add them up for the account level or the day level.

Based on the above classifications, there are two types of fact tables -
Cumulative - This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table.
Snapshot - This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.

Lookup Table (Dimension Table) – The lookup table provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies the quarter, and one or more additional fields that specifies how that particular quarter is represented on a report (for example, first quarter of 2001 may be represented as “Q1 2001″ or “2001 Q1″).

To maintain historic record and identify each record uniquely dimension tables will have Surrogate keys. It is the system generated key which cannot be edited by the user. Surrogate keys are extremely useful when storing historical information.

For example Consider one has a table in which a person and his location are stored. Now when his location is changed and we want to keep a historical record of the same it is stored with a surrogate key that will help us to uniquely identify the record.

Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.

In designing data models for data warehouses / data marts, the most commonly used schema types are Star Schema and Snowflake Schema.

Star Schema – In the star schema design, a single object (the fact table) sits in the middle and is radially connected to other surrounding objects (dimension lookup tables) like a star. Each dimension is represented as a single table. The primary key in each dimension table is related to a foreign key in the fact table.

All measures in the fact table are related to all the dimensions that fact table is related to. In other words, they all have the same level of granularity.

A star schema can be simple or complex. A simple star consists of one fact table; a complex star can have more than one fact table.

Let’s look at an example - Assume our data warehouse keeps store sales data, and the different dimensions are time, store, product, and customer. In this case, the following figure represents our star schema.

There exists a primary key / foreign key relationship between the fact and dimension tables. Note that different dimensions are not related to one another.


Snowflake Schema – The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. In a star schema, each dimension is represented by a single dimensional table, whereas in a snowflake schema, that dimensional table is normalized into multiple lookup tables, each representing a level in the dimensional hierarchy.

For example, the Time Dimension that consists of 2 different hierarchies:

1. Year → Month → Day
2. Week → Day

We will have 4 lookup tables in a snowflake schema: A lookup table for year, a lookup table for month, a lookup table for week, and a lookup table for day. Year is connected to Month, which is then connected to Day. Week is only connected to Day. A sample snowflake schema illustrating the above relationships in the Time Dimension is shown below.


The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase number of lookup tables.





0 comments:

Post a Comment