Introduction to Data Quality Services

Working as SQL BI Specialist its required that I keep myself updated with latest advancement of SQL technology. In that process recently I learnt about a new feature of SQL Server 2012, which I would like to share in this series of articles.

Capturing and managing data is very crucial to any business. Data shows how a business is performing and provides pointers to decisions that need to be taken for business development. Microsoft BI stack provides various tools to capture and manage data. Business data originates from various sources and there is possibility of incorrect data coming in. It’s important to capture valid data hence lots of transformation needs to be applied to validate and correct data.

SQL Server 2012 provides new feature called Data Quality Services (DQS), which helps in ensuring the data is valid. DQS is a knowledge driven data quality product. It allows you to build knowledge bases that handle the traditional data quality tasks such as profiling, correction, enrichment, standardization and de-duplication.

DQS Architecture

The two primary components of DQS architecture are : Data Quality Server and Data Quality Client. These components enable you to perform data quality services separately from other SQL Server operations. Both are installed from within the SQL Server setup program.

Data Quality Server is implemented as three SQL Server catalogs that you can manage and monitor in the SQL Server Management Studio (DQS_MAIN, DQS_PROJECTS, and DQS_STAGING_DATA).

DQS_MAIN includes DQS stored procedures, the DQS engine, and published knowledge bases.
DQS_PROJECTS includes data that is required for knowledge base management and DQS project activities.
DQS_STAGING_DATA provides an intermediate staging database where you can copy your source data to perform DQS operations, and then export your processed data.

Data Quality Client is a standalone application that enables you to perform knowledge management, data quality projects, and administration in one user interface. The application is designed for both data stewards and DQS administrators. It is a stand-alone executable file that performs knowledge discovery, domain management, matching policy creation, data cleansing, matching, profiling, monitoring, and server administration. Data Quality Client can be installed and run on the same computer as Data Quality Server or remotely on a separate computer. Many operations in Data Quality Client are wizard-driven for ease of use.

The DQS architecture also includes components that are built into other SQL Server 2012 features. For example, Integration Services has the DQS Cleansing transformation you use to apply data-cleansing rules to a data flow pipeline. In addition, Master Data Services supports DQS matching so that you can de-duplicate data before adding it as master data.

DQS is a knowledge driven data quality product. It means that first you have to create a knowledge base (KB) that is relevant to the data sources which you would like to cleanse. You build this KB by acquiring knowledge from samples of your data (a process we call “knowledge discovery”), by configuring the KB to use external knowledge from Windows Azure Data Market, 3rd party reference data services or by manually adding knowledge to the KB. The knowledge is stored in the context of data entities that we call data domains. Some examples for domains include City, State, Email Address, etc. The knowledge in the domains consists of good values, bad values, relations between values, validation rules, etc. Creating and managing knowledge bases are done through the DQS client.

Introduction to Data Quality Services

Related Post

0 comments:

Post a Comment