GettingStarted - With Data Quality Guide
GettingStarted - With Data Quality Guide
Version 8.5
August 2007
This software and documentation contain proprietary information of Informatica Corporation, and are provided under a license agreement containing
restrictions on use and disclosure and are also protected by copyright law. Reverse engineering of the software is prohibited. No part of this document may be
reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without the prior written consent of Informatica
Corporation.
Use, duplication, or disclosure of the Software by the U.S. Government is subject to the restrictions set forth in the applicable software license agreement and as
provided in DFARS 227.7202-1(a) and 227.7702-3(a) (1995), DFARS 252.227-7013(c)(1)(ii) (OCT 1988), FAR 12.212(a) (1995), FAR 52.227-19, or FAR
52.227-14 (ALT III), as applicable.
Informatica, PowerCenter, PowerCenterRT, PowerExchange, PowerCenter Connect, PowerCenter Data Analyzer, PowerMart, Metadata Manager, Informatica
Data Quality and Informatica Data Explorer are trademarks or registered trademarks of Informatica Corporation in the United States and in jurisdictions
throughout the world. All other company and product names may be trade names or trademarks of their respective owners. U.S. Patent Pending.
Portions of this software and/or documentation are subject to copyright held by third parties, including without limitation: Copyright © Sun Microsystems. All
rights reserved. Copyright © Platon Data Technology GmbH. All rights reserved. Copyright © Melissa Data Corporation. All rights reserved. Copyright ©
1995-2006 MySQL AB. All rights reserved
This product includes software developed by the Apache Software Foundation (http://www.apache.org/). The Apache Software is Copyright © 1999-2006 The
Apache Software Foundation. All rights reserved.
ICU is Copyright (c) 1995-2003 International Business Machines Corporation and others. All rights reserved. Permission is hereby granted, free of charge, to
any person obtaining a copy of the ICU software and associated documentation files (the “Software”), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish, distribute, and/or sell copies of the Software, and to permit persons to whom the Software is
furnished to do so.
ACE(TM)and TAO(TM), are copyrighted by Douglas C. Schmidt and his research group at Washington University, University of California, Irvine, and
Vanderbilt University, Copyright (c) 1993-2006, all rights reserved.
Tcl is copyrighted by the Regents of the University of California, Sun Microsystems, Inc., Scriptics Corporation and other parties. The authors hereby grant
permission to use, copy, modify, distribute, and license this software and its documentation for any purpose.
InstallAnywhere is Copyright © Macrovision (Copyright ©2005 Zero G Software, Inc.) All Rights Reserved.
Portions of this software use the Swede product developed by Seaview Software (www.seaviewsoft.com).
This product includes software developed by the JDOM Project (http://www.jdom.org/). Copyright © 2000-2004 Jason Hunter and Brett McLaughlin. All
rights reserved.
This product includes software developed by the JFreeChart project (http://www.jfree.org/freechart/). Your right to use such materials is set forth in the GNU
Lesser General Public License Agreement, which may be found at http://www.gnu.org/copyleft/lgpl.html. These materials are provided free of charge by
Informatica, “as is”, without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness
for a particular purpose.
This product includes software developed by the JDIC project (https://jdic.dev.java.net/). Your right to use such materials is set forth in the GNU Lesser General
Public License Agreement, which may be found at http://www.gnu.org/copyleft/lgpl.html. These materials are provided free of charge by Informatica, “as is”,
without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular
purpose.
This product includes software developed by lf2prod.com (http://common.l2fprod.com/). Your right to use such materials is set forth in the Apache License
Agreement, which may be found at http://www.apache.org/licenses/LICENSE-2.0.html.
DISCLAIMER: Informatica Corporation provides this documentation "as is" without warranty of any kind, either express or implied, including, but not limited
to, the implied warranties of non-infringement, merchantability, or use for a particular purpose. The information provided in this documentation may include
technical inaccuracies or typographical errors. Informatica could make improvements and/or changes in the products described in this documentation at any
time without notice.
Table of Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vii
About This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Document Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Other Informatica Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Visiting Informatica Customer Portal . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Visiting the Informatica Web Site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Visiting the Informatica Knowledge Base . . . . . . . . . . . . . . . . . . . . . . . . ix
Obtaining Customer Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
iii
Plan 01: Profile Demo Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Plan 02: Pre-Standardization Scorecard . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Plan 03: Standardize Generic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Plan 04: Standardize Name Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Plan 05: Standardize Address Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Plan 06: Match Demo Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Plan 07: Consolidate Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Plan 08: View Consolidated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37
iv Table of Contents
List of Figures
Figure 2-1. Informatica Data Quality Workbench User Interface . . . . . . . . .. . .. . . .. . .. . . 11
Figure 2-2. Creating a New Project in Data Quality Workbench . . . . . . . . .. . .. . . .. . .. . . 13
Figure 2-3. Character Labeller configuration dialog box — Parameters tab . .. . .. . . .. . .. . . 16
Figure 2-4. Edit Distance, Parameters tab (Null Settings) . . . . . . . . . . . . . .. . .. . . .. . .. . . 18
Figure 2-5. Dictionary Manager and business_word Dictionary sample. . . . .. . .. . . .. . .. . . 19
Figure 3-1. Sample Plan List in Workbench User Interface1 . . . . . . . . . . . .. . .. . . .. . .. . . 22
List of Figures v
vi List of Figures
Preface
Welcome to Informatica Data Quality, the latest-generation data quality management system
from Informatica Corporation. Informatica Data Quality will empower your organization to
solve its data quality problems and realize real, sustainable data quality improvements.
This guide will enable you to understand the principles of data quality and the functionality
of Informatica Data Quality applications and plans. Its intended audience includes third-
party software developers and systems administrators who are installing Data Quality within
their IT infrastructure, business users who wish to use Data Quality and learn more about its
operations, and PowerCenter users who may access Data Quality tools when working with the
Informatica Data Quality Integration plug-in.
vii
About This Book
The material in this guide is available in printed form from Informatica Corporation.
Document Conventions
This guide uses the following formatting conventions:
italicized monospaced text This is the variable name for a value you enter as part of an
operating system command. This is generic text that should be
replaced with user-supplied values.
Warning: The following paragraph notes situations where you can overwrite
or corrupt data, unless you follow the specified procedure.
bold monospaced text This is an operating system command you enter from a prompt to
run a task.
viii Preface
Other Informatica Resources
In addition to the product manuals, Informatica provides these other resources:
♦ Informatica Customer Portal
♦ Informatica web site
♦ Informatica Knowledge Base
♦ Informatica Global Customer Support
Preface ix
Use the following telephone numbers to contact Informatica Global Customer Support:
North America / South America Europe / Middle East / Africa Asia / Australia
x Preface
Chapter 1
1
Informatica Data Quality Product Suite
Welcome to Informatica Data Quality 8.5.
Informatica Data Quality is a suite of applications and components that you can integrate
with Informatica PowerCenter to deliver enterprise-strength data quality capability in a wide
range of scenarios.
The core components are:
♦ Data Quality Workbench. Use to design, test, and deploy data quality processes, called
plans. Workbench allows you to test and execute plans as needed, enabling rapid data
investigation and testing of data quality methodologies. You can also deploy plans, as well
as associated data and reference files, to other Data Quality machines. Plans are stored in a
Data Quality repository.
Workbench provides access to fifty database-based, file-based, and algorithmic data quality
components that you can use to build plans.
♦ Data Quality Server. Use to enable plan and file sharing and to run plans in a networked
environment. Data Quality Server supports networking through service domains and
communicates with Workbench over TCP/IP. Data Quality Server allows multiple users to
collaborate on data projects, speeding up the development and implementation of data
quality solutions.
You can install the following components alongside Workbench and Server.
♦ Integration Plug-In. Informatica plug-in enabling PowerCenter to run data quality plans
for standardization, cleansing, and matching operations. The Integration plug-in is
included in the Informatica Data Quality install fileset.
♦ Free Reference Data. Text-based dictionaries of common business and customer terms.
♦ Subscription-Based Reference Data. Databases, sourced from third parties, of deliverable
postal addresses in a country or region.
♦ Pre-Built Data Quality Plans. Data quality plans built by Informatica. to perform out-of-
the-box cleansing, standardization, and matching operations. Informatica provides free
demonstration plans. You can purchase pre-built plans for commercial use.
♦ Association Plug-In. Informatica plug-in enabling PowerCenter to identify matching data
records from multiple Integration transformations and associate these records together for
data consolidation purposes.
♦ Consolidation Plug-In. Informatica plug-in enabling PowerCenter to compare the linked
records sent as output from an Association transformation and to create a single master
record from these records.
For information on installing and configuring these components, see the Informatica Data
Quality Installation Guide.
9
Overview
This chapter will help you learn more about Data Quality Workbench and the operations of
data quality plans.
Although the Workbench user interface is straightforward to use, the data quality processes,
or plans, built in Workbench can be as simple or complex as you decide. Also, a key
characteristic of a data quality plan is the inter-dependency of the configured elements within
it.
Bear in mind that the goal of a data quality plan is not always to achieve the highest data
quality readings — that is, to find zero duplicates, or deliver a 100% data accuracy score. At
the end of a data quality project, these may be realistic objectives. However, the goals of most
analysis plans are twofold: to achieve the most faithful representation of the state of the
dataset, and to highlight the outstanding data quality characteristics of interest to the
business.
While a plan designer may tune a plan so that it captures the data quality characteristics more
accurately, it is also possible to tune a plan’s settings so that the data quality results appear to
improve beyond the reality of the data. The ability to tune a plan properly comes with
training and experience in Data Quality Workbench.
Note: The full range of functionality available in Data Quality Workbench is described in the
Informatica Data Quality User Guide.
The Workbench user interface shows tabs to the left for the Project Manager and File
Manager and a workspace to the right in which you’ll design plans. The fifty data components
that you can use in a plan are shown on a dockable panel on the right-hand side.
There are three basic types of component:
Data Sources, which define the data inputs for the plan. Sources can connect to files or
database tables.
Operational Components, which perform data analysis or data transformation
operations on the data.
Many operational components make use of reference dictionaries when analyzing or
transforming data; dictionaries are explained below.
Data Sinks, which define the data outputs that will be written to file or to the database.
You can rename or delete the four folders default-created beneath this project.
Copy or import one or more plans to this project. You can make a copy of a plan within the
Project Manager by highlighting the plan name and typing Ctrl+C. You can then paste the
plan to the project you have just created by right-clicking the project name and selecting Paste
from the context menu.
You can also import a plan file in PLN form or XML format from the file system. This may
be appropriate if you have received pre-built plans from Informatica, in which case backup
copies of the plans may be installed to your file system.
♦ For more information on importing plans, see the Informatica Data Quality User Guide.
Rename the project and imported plan(s). You should give names to your new projects and
plans that clearly distinguish them from the installed data quality project.
1. Click the Realtime Source in the plan workspace and press Delete.
2. Click a CSV Source in the component palette and add it to the workspace. If the plan
already contains a CSV Source or a CSV Match Source, skip this step.
Tip: Place the CSV Source in the workspace so that it above or to the left of the other
components in the plan.
3. Right-click the CSV Source icon and select Configure from the context menu that
appears.
4. When the source configuration dialog box opens, click the Select button and navigate to
a delimited file.
5. Review the other options in the configuration dialog: confirm the field delimiter for the
file values, and indicate if the first line of the file contains header information.
Note: make sure that the Enable Realtime option is cleared.
Configuring Components 15
Note: To run a plan, click the Run Plan button on the Workbench toolbar. For more
information on running plans, see “Running Plans: Local and Remote Execution” on page 8.
This settings in this dialog box can illustrate the implications of changing component settings
within a plan.
The Standard Symbols group box provides settings that determine the types of character that
will be labelled by this component and how they will be labelled. For example, you can pass a
telephone number record through this component and define a new output field that will
map its characters by type, such that the number 555-555-1000 generates a new value nnn-
nnn-nnnn.
The manner in which a number like this, or non-numeric data, is labelled depends on the
symbol settings. For example, you can check the Symbol field and enter a label for non-
alphanumeric values. For example, the number 555-555-1000 may appear as nnnxnnnxnnnn
depending on the symbol value you provide here.
If your plan is concerned simply with labelling the characters by type, then changing the
symbol value to x will not meaningfully affect the results of the plan. However, if the
component output is read by another component — or the plan outputs read by another plan
— then such a change may have a serious impact.
OR Input1 = nnnnn-nnnn
(Output in this example refers to the original data fields that were labelled by the Character
Labeller.)
If the Character Labeller configuration was changed so that symbols were written as any
character other than a hyphen — e.g. nnnnnsnnnn — and the associated rule was not
changed, then the plan will not produce meaningful results.
The Rule can be changed by adding a line to this effect, i.e.
IF Input1 = nnnnn
OR Input1 = nnnnn-nnnn
OR Input1 = nnnnnsnnnn
Edit Distance
For example, the Edit Distance component derives a match score for two data strings by
calculating the minimum “cost” of transforming one string into another by inserting,
deleting, or replacing characters. The dissimilarity between the strings Washington and
Washingtin can be remedied by editing a single character, so the match score is 0.9 (the score
is penalized by ten per cent as one of the ten characters must be edited).
When one (or each) input string is null, the component applies a set score that can be tuned
by the user. The default scores are 0.5. You can change these scores to reflect the severity of
the presence of null fields in the selected data.
Configuring Components 17
Figure 2-4. Edit Distance, Parameters tab (Null Settings)
Third-party dictionaries originate from postal address database companies that provide
authoritative address validation capability to Data Quality’s validation components. They are
21
Overview
This chapter demonstrates plan building in action by following a simple data quality project
from start to finish. The project operations take place in Data Quality Workbench. They do
not involve any Workbench-Server functionality or any interaction with PowerCenter.
IDQDemo Plans
The Data Quality install process installs a project named IDQDemo to the Data Quality
repository and also writes a copy of the plan files to the Informatica Data Quality folder
structure. The high-level objective of this sample project is to profile and cleanse a dataset of
business-to-business records.
Figure 3-1 shows the layout of the installed plans in Workbench.
The project analyzes, cleanses, and standardizes a United States business-to-business dataset of
approximately 1,200 records. This dataset is installed in the IDQDemo\Sources folder with
the filename IDQDemo.csv. The dataset comprises the following columns:
♦ Customer Number
♦ Contact Name
♦ Company Name
♦ Address 1
♦ Address 2
♦ Address 3
♦ Address 4
♦ Zipcode
♦ ISO Country Code
♦ Currency
♦ Customer Turnover
Overview 23
Plan 01: Profile Demo Data
This plan analyzes the IDQDemo.csv source data and generates a Informatica Report file as
output.
The components have been configured to assess data quality in the following ways:
♦ The Merge component has been configured to merge four fields from IDQDemo.csv
(Address1—Address4) into a single column named Merged Address. It also merges the
ISO Country Code and Currency columns into a Merged CountryCode and Currency
column that will be analyzed by the Token Labeller.
♦ The Merged Address column is used as input by the Context Parser, which applies a
reference dictionary of city and town names to the merged data and writes the output to a
new column named CP City or Town. Any American city names found in Merged Address
are written to this new column.
♦ The Rule Based Analyzer has been configured to apply business rules to the
Customer_Number and CP City or Town fields.
The Test Completeness of Cust_No rule comprises a simple IF statement. If a value is
present in a Customer_Number field, the rule writes Complete Customer_Number to a
corresponding field in a new column named Customer Number Completeness. If not, the
rule writes Incomplete Customer_Number in the relevant field.
The Test Completeness of Town rule profiles the completeness of the CP City or Town
column through a similar IF statement. An empty field in the CP City or Town column
indicates that the underlying address data lacks recognizable city/town information.
Conversely, any name in a CP City or Town field has already been verified by the Context
Parser (see above); the rule writes such names to a new column named City_or_Town
Completeness.
♦ The Character Labeller analyzes the conformity of the Customer_Number field. (Brief
analysis of IDQDemo.csv indicates that all customer account numbers begin with the
digits 159, 191, or 101: type F6 to open the Source Viewer and examine a subset of the
data.)
♦ On the Filters tab of the component’s configuration dialog box, filters have been defined
to identify the account numbers that begin with 159, 191, and 101. All numbers so
identified are written to a new column named Customer Number Conformity as specified
on the component’s Outputs tab.
♦ The Token Labeller applies reference dictionaries to analyze conformity in the Contact
Name, Company Name, Zipcode, Currency, ISO Country Code, and Merged
CountryCode and Currency columns.
Contact and company name data are analyzed against dictionaries of name prefixes, first
names, and surnames, and against dictionaries of US company names. Similarly, zip code
and currency data are analyzed against dictionaries of valid zip codes and currency names
respectively.
Output1 := "Ms"
ELSE
Output1 := Input1
ENDIF
D configuring 15
data sinks 11
Data Quality Plans 11 data source types 15
Data Quality Repository data sources 11
copying plans in 13 Edit Distance example 17
described 3 operational components 11
Dictionaries 19 Rule Based Analyzer example 16
Weight Based Analyzer example 18
Plans 5
I PowerCenter Data Cleanse and Match 3
IDQDemo Project 21
Plan 01 Profile Demo Data 24
Plan 02 Pre-Standardization Scorecard 26 U
Plan 03 Standardize Generic Data 27 User Interface 11
Plan 04 Standardize Name Data 30
Plan 05 Standardize Address Data 32
Plan 06 Match Demo Data 34
Plan 07 Consolidate Data 35
Plan 08 View Consolidated Data 36
Informatica Data Quality
Integration Plug-In 2
Server 2, 3
Workbench 2, 3
P
Plan Components
Character Labeller example 16
37
38 Index