Course Guide - IBM Information Analyzer Essentials v11.5 (Course code KM803 ERC 2.0).pdf

1,145 316 14MB

English Pages 486 Year 2016

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Demonstrations, Exercises, Exercise Solutions - IBM Information Analyzer Essentials v11.5 (Course code KM803 ERC 2.0).pdf

Demonstrations, Exercises, Exercise Solutions, IBM Information Analyzer Essentials

718 120 7MB Read more

Course guide for essentials of anatomy & physiology [Seventh edition.] 9781259864636, 1259864634

1,369 129 24MB Read more

Grad School Essentials: A Crash Course in Scholarly Skills 9780520963269

What’s the hardest part of grad school? It’s not simply that the workload is heavy and the demands are high. It’s that t

170 100 681KB Read more

CompTIA A+ Guide to Information Technology Technical Support (MindTap Course List) [11 ed.] 0357674162, 9780357674161

Using a step-by-step, highly visual approach, Andrews/Dark Shelton/Pierce's bestselling COMPTIA A+ GUIDE TO IT TECH

3,323 532 96MB Read more

Maxwell's Demon: Entropy, Information, Computing [Course Book ed.] 9781400861521

About 120 years ago, James Clerk Maxwell introduced his now legendary hypothetical "demon" as a challenge to t

150 29 27MB Read more

Code Gamers Development Essentials 9781785287596

183 10 2MB Read more

Python Crash Course For Beginners : A Crash Course Guide To Learn Python In 1 Week

2,957 460 2MB Read more

IBM Rational Team Concert 2 Essentials 1849681600, 9781849681605

With their straightforward style, Suresh Krishna and TC Fenstermaker have put their years of experience and motivation i

674 38 18MB Read more

Crash Course

“Relive the story of an ill-fated commercial flight that was doomed before it was ever airborne; and find out how it may

235 53 1MB Read more

Information Technology - Security Techniques - Code of Practice for Information Security

2,530 383 1MB Read more

Course Guide - IBM Information Analyzer Essentials v11.5 (Course code KM803 ERC 2.0).pdf

Author / Uploaded
IBM

Categories
Computers

Table of contents :
KM803_Preface......Page 0
01-Information_analysis_overview......Page 17
02-Information_Server_overview......Page 49
03-Information_Analyzer_overview......Page 69
04-Information_analyzer_setup......Page 93
05-Data_Classes......Page 147
06-Column_Analysis......Page 187
07-Data_profiling_techniques......Page 241
08-Table_analysis......Page 303
09-Cross_table_analysis......Page 335
10-Baseline_analysis......Page 363
11-Reporting_and_publishing_results......Page 381
12-Data_rules_and_metrics......Page 413

Citation preview

------- --- ----

::..::..:::�::e

Course Guide

IBM Information Analyzer Essentials v11.5 Course code KM803 ERC 2.0

IBM Training

Preface

August, 2016 NOTICES This information was developed for products and services offered in the USA. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive, MD-NC119 Armonk, NY 10504-1785 United States of America The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. TRADEMARKS IBM, the IBM logo, ibm.com and InfoSphere are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml. Adobe, the Adobe logo, are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. © Copyright International Business Machines Corporation 2016. This document may not be reproduced in whole or in part without the prior written permission of IBM. US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

P-2

Preface

Contents Preface................................................................................................................. P-1 Contents ............................................................................................................. P-3 Course overview............................................................................................... P-12 Document conventions ..................................................................................... P-14 Additional training resources ............................................................................ P-15 IBM product help .............................................................................................. P-16 Information analysis overview ............................................................. 1-1 Unit objectives .................................................................................................... 1-3 Information analysis context and problem description ........................................ 1-4 System source data assessment ........................................................................ 1-5 Data assessment process .................................................................................. 1-6 What does data profiling provide? ...................................................................... 1-8 What does data analysis add?............................................................................ 1-9 Subject matter experts' role .............................................................................. 1-10 IBM InfoSphere Suite used in data assessment ............................................... 1-11 Information server data quality assessment tools ............................................. 1-12 Information Analyzer features ........................................................................... 1-13 QualityStage features ....................................................................................... 1-14 What tool to use when ...................................................................................... 1-15 Data assessment path: Functional view ........................................................... 1-16 Make data profiling a process ........................................................................... 1-17 Checkpoint ....................................................................................................... 1-18 Checkpoint solutions ........................................................................................ 1-19 Demonstration 1: Read case study ................................................................... 1-20 Demonstration 2: Read project scenario........................................................... 1-23 Demonstration 3: Review Chemco data ........................................................... 1-28 Unit summary ................................................................................................... 1-31 Information Server overview ................................................................ 2-1 Unit objectives .................................................................................................... 2-3 Information Server components.......................................................................... 2-4 Architecture ........................................................................................................ 2-6 Information Server a platform more than a product ............................................ 2-7 Client-Server architecture ................................................................................... 2-8 Client icons ......................................................................................................... 2-9 Using the Information Server thin client ............................................................ 2-10 Server management: users and groups............................................................ 2-11

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

P-3

Preface

Checkpoint ....................................................................................................... 2-12 Checkpoint solutions ........................................................................................ 2-13 Demonstration 1: Information Server setup ...................................................... 2-14 Unit summary ................................................................................................... 2-19 Information Analyzer overview ............................................................ 3-1 Unit objectives .................................................................................................... 3-3 InfoSphere Information Analyzer ........................................................................ 3-4 Profiling and analysis functionality ...................................................................... 3-5 Reporting............................................................................................................ 3-6 Security .............................................................................................................. 3-7 Shared metadata ................................................................................................ 3-8 Analysis execution architecture .......................................................................... 3-9 Information Analyzer: Login .............................................................................. 3-10 Information Analyzer: Home page .................................................................... 3-11 Pillar menus ..................................................................................................... 3-12 Online documentation....................................................................................... 3-13 User interface features ..................................................................................... 3-14 Manage information displayed .......................................................................... 3-15 Display details graphically ................................................................................ 3-16 Set preferences ................................................................................................ 3-17 Checkpoint ....................................................................................................... 3-18 Checkpoint solutions ........................................................................................ 3-19 Demonstration 1: Information Analyzer tour...................................................... 3-20 Unit summary ................................................................................................... 3-24 Information Analyzer setup .................................................................. 4-1 Unit objectives .................................................................................................... 4-3 Resource configuration and metadata import ..................................................... 4-4 Configuring resources: Where is the data? ......................................................... 4-5 Configuring resources: Connecting the data ....................................................... 4-6 Metadata asset management ............................................................................. 4-7 Setting up Data Connection & Import metadata in IMAM.................................... 4-8 Metadata Asset Manager ................................................................................... 4-9 Metadata import: Discovering metadata ........................................................... 4-10 Importing metadata assets ............................................................................... 4-11 Creating a new import area .............................................................................. 4-12 Import parameters ............................................................................................ 4-13 Data connection ............................................................................................... 4-14 New data connection ........................................................................................ 4-15

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

P-4

Preface

New data connection identity ............................................................................ 4-16 Select type of import ......................................................................................... 4-17 View results in the staging area ........................................................................ 4-18 Flat file definition wizard ................................................................................... 4-19 Flat file definition wizard ................................................................................... 4-20 Flat file definition wizard prerequisite tasks ...................................................... 4-21 Flat file definition wizard ................................................................................... 4-22 Creating and configuring projects ..................................................................... 4-23 Projects ............................................................................................................ 4-24 Creating a project ............................................................................................. 4-25 Complete project properties: 7 categories ........................................................ 4-26 Project data source administration ................................................................... 4-27 Register interest in data to be analyzed............................................................ 4-28 Add users/groups to a project and define role .................................................. 4-29 Adding users/groups to a project ...................................................................... 4-30 Analysis configuration....................................................................................... 4-31 Project analysis settings ................................................................................... 4-32 Checkpoint ....................................................................................................... 4-33 Checkpoint solutions ........................................................................................ 4-34 Demonstration 1: Configuring Information Analyzer.......................................... 4-35 Unit summary ................................................................................................... 4-54 Unit 5

Data Classes.......................................................................................... 5-1

Unit objectives .................................................................................................... 5-3 Goal is to document the data .............................................................................. 5-4 Business metadata ............................................................................................. 5-5 Information Analyzer new features ..................................................................... 5-6 Information Governance Catalog: Data Classes ................................................. 5-7 Information Governance Catalog: Data Classes installed ................................... 5-8 Examples of the Three Types of Data Classes ................................................... 5-9 Demonstration 1: IGC data classes .................................................................. 5-11 Information Analyzer: Data Classification ......................................................... 5-15 Information Analyzer data classes .................................................................... 5-16 Column Analysis - Details - Data Class ............................................................ 5-17 Information Governance Catalog - data classes ............................................... 5-18 Information Governance Catalog - disabling a class ......................................... 5-20 Information Governance Catalog - deselecting a class ..................................... 5-21 Data Classification Summary............................................................................ 5-22 Information Analyzer thin client......................................................................... 5-23 Information Analyzer thin client......................................................................... 5-24 Information Analyzer thin client terminology ..................................................... 5-27 Information Analyzer thin client - Advanced Search.......................................... 5-28 © Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

P-5

Preface

Data Quality Score ........................................................................................... 5-29 Data Quality Score Example ............................................................................ 5-30 Checkpoint ....................................................................................................... 5-31 Checkpoint solutions ........................................................................................ 5-32 Demonstration 2: Familiarization with IA thin client ........................................... 5-33 Unit summary ................................................................................................... 5-40 Column Analysis ................................................................................... 6-1 Unit objectives .................................................................................................... 6-3 Understand the business problem ...................................................................... 6-4 Column Analysis overview .................................................................................. 6-5 What does Column Analysis do? ........................................................................ 6-6 Why is this important? ........................................................................................ 6-7 Structural integrity .............................................................................................. 6-8 Domain integrity ................................................................................................. 6-9 Domain integrity: Do you know what the field contains? ................................... 6-10 Domain integrity: What to look for? ................................................................... 6-11 Analysis process .............................................................................................. 6-12 Column Analysis: Step by step ......................................................................... 6-13 Column Analysis: Run Column Analysis ........................................................... 6-14 Demonstration 1: Column Analysis ................................................................... 6-15 Column Analysis review: How to open.............................................................. 6-30 Column Analysis using data class as guidepost ............................................... 6-31 Column Analysis: Data Classification ............................................................... 6-32 Column Analysis - New data classification ....................................................... 6-33 Column Analysis: Properties ............................................................................ 6-34 Column Analysis: Domain and Completeness .................................................. 6-35 Column domain using reference table .............................................................. 6-36 Column Analysis: Reference tables .................................................................. 6-37 Column Analysis: Reference table types .......................................................... 6-38 Demonstration 2: Create reference tables ........................................................ 6-39 Creating virtual columns ................................................................................... 6-41 Identify virtual column components .................................................................. 6-43 Analyze virtual column...................................................................................... 6-44 Demonstration 3: Create virtual column............................................................ 6-45 Column Analysis: Format ................................................................................. 6-48 Column Analysis: Notes ................................................................................... 6-49 Demonstration 4: Create note........................................................................... 6-50 Checkpoint ....................................................................................................... 6-52 Checkpoint solutions ........................................................................................ 6-53 Unit summary ................................................................................................... 6-54

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

P-6

Preface

Unit 7

Data profiling techniques ..................................................................... 7-1

Unit objectives .................................................................................................... 7-3 Where to start? ................................................................................................... 7-4 Data Profiling - New performance options .......................................................... 7-5 Metadata integrity: Do you know what the field is? ............................................. 7-6 Metadata integrity: What to look for? .................................................................. 7-7 Metadata integrity: What to add? ........................................................................ 7-8 Domain analysis by data class: What to look for?............................................... 7-9 Asses validity by data class .............................................................................. 7-10 Data classification summary ............................................................................. 7-12 Assess identifiers ............................................................................................. 7-13 Review identifier properties .............................................................................. 7-14 Review identifier domain values and formats.................................................... 7-15 Verify indicators ................................................................................................ 7-16 Review indicator properties .............................................................................. 7-17 Nulls and blanks in indicators ........................................................................... 7-18 Skewing of indicator values .............................................................................. 7-19 Find and document indicator issues ................................................................. 7-20 Validate codes .................................................................................................. 7-21 Review code properties .................................................................................... 7-23 Nulls and blanks in codes ................................................................................. 7-24 Skewing of code values .................................................................................... 7-25 Find and document code issues ....................................................................... 7-26 Assess quantities ............................................................................................. 7-27 Review quantity properties ............................................................................... 7-29 Nulls, spaces and zeroes in quantities.............................................................. 7-31 Skewing of quantity values ............................................................................... 7-32 Find and document quantity issues .................................................................. 7-33 Analyze dates ................................................................................................... 7-34 Review date properties ..................................................................................... 7-35 Nulls, spaces and zeroes in dates .................................................................... 7-36 Skewing of date values..................................................................................... 7-37 Find and document date issues ........................................................................ 7-38 Review text fields ............................................................................................. 7-39 Additional text field considerations ................................................................... 7-40 Summary .......................................................................................................... 7-41 Checkpoint ....................................................................................................... 7-42 Checkpoint solutions ........................................................................................ 7-43 Demonstration 1: Data classification................................................................. 7-44 Unit summary ................................................................................................... 7-61

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

P-7

Preface

Unit 8

Table analysis ....................................................................................... 8-1

Unit objectives .................................................................................................... 8-3 Keys ................................................................................................................... 8-4 Primary key determination .................................................................................. 8-5 Primary Key: Walkthrough .................................................................................. 8-7 Primary Key analysis: Single column key details ................................................ 8-8 Single column key duplicates ............................................................................. 8-9 Multi column key analysis ................................................................................. 8-10 Multi column Primary Key ................................................................................. 8-11 Data sampling .................................................................................................. 8-12 Sampling methods............................................................................................ 8-13 Data sample properties .................................................................................... 8-14 Run analysis ..................................................................................................... 8-15 View results of multi-column key analysis ......................................................... 8-16 Duplicate check result ...................................................................................... 8-17 Duplicate check ................................................................................................ 8-18 Basic data profiling techniques in practice ........................................................ 8-19 Determine structural integrity ............................................................................ 8-20 Structural integrity: Is the structure usable? ...................................................... 8-21 Checkpoint ....................................................................................................... 8-22 Checkpoint solutions ........................................................................................ 8-23 Demonstration 1: Primary key analysis............................................................. 8-24 Unit summary ................................................................................................... 8-31 Unit 9

Cross Table Analysis ............................................................................ 9-1

Unit objectives .................................................................................................... 9-3 What is cross table analysis? ............................................................................. 9-4 Foreign Key analysis .......................................................................................... 9-5 Referential integrity ............................................................................................ 9-6 FK analysis: Initial steps ..................................................................................... 9-7 FK analysis: Select pair table ............................................................................. 9-8 FK analysis: Review results ................................................................................ 9-9 FK analysis: Review domain overlap exceptions .............................................. 9-10 Referential integrity: Can related data be linked? ............................................. 9-11 Demonstration 1: Foreign key analysis ............................................................. 9-12 Cross domain analysis ..................................................................................... 9-16 View analysis details for cross domain ............................................................. 9-17 View frequency values...................................................................................... 9-18 Cross-Table integrity review ............................................................................. 9-19 Cross-Table data redundancy .......................................................................... 9-20

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

P-8

Preface

Cross-Table data references ............................................................................ 9-21 Checkpoint ....................................................................................................... 9-22 Checkpoint solutions ........................................................................................ 9-23 Demonstration 2: Cross domain analysis ......................................................... 9-24 Unit summary ................................................................................................... 9-27 Unit 10

Baseline analysis ............................................................................. 10-1

Unit objectives .................................................................................................. 10-3 Baseline analysis: Understanding the business problem .................................. 10-4 Overview .......................................................................................................... 10-5 Starting baseline analysis ................................................................................. 10-6 Setting the baseline .......................................................................................... 10-7 View the baseline analysis ............................................................................... 10-9 View the baseline analysis summary .............................................................. 10-11 View the baseline analysis differences ........................................................... 10-12 Checkpoint ..................................................................................................... 10-13 Checkpoint solutions ...................................................................................... 10-14 Demonstration 1: Baseline analysis ................................................................ 10-15 Unit summary ................................................................................................. 10-18 Unit 11

Reporting and publishing results ................................................... 11-1

Unit objectives .................................................................................................. 11-3 Communicating the analysis results ................................................................. 11-4 Reporting.......................................................................................................... 11-5 Reporting: Selecting report types ..................................................................... 11-6 Reporting: Report model .................................................................................. 11-7 Reports............................................................................................................. 11-8 Reporting: Creating new reports ....................................................................... 11-9 Reports: Running reports ............................................................................... 11-10 Reporting: Viewing reports ............................................................................. 11-11 Reporting: View reports by date ..................................................................... 11-12 Demonstration 1: Reporting ............................................................................ 11-13 Publish analysis results .................................................................................. 11-16 View published results from DataStage .......................................................... 11-17 Create DataStage table definition ................................................................... 11-18 View published information: Table level.......................................................... 11-19 View published information: Column level ...................................................... 11-20 Exporting DDL ................................................................................................ 11-21 Export a reference table ................................................................................. 11-22

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

P-9

Preface

Checkpoint ..................................................................................................... 11-23 Checkpoint solutions ...................................................................................... 11-24 Demonstration 2: Publishing results ............................................................... 11-25 Demonstration 3: Export reference tables ...................................................... 11-30 Unit summary ................................................................................................. 11-32 Data rules and metrics ..................................................................... 12-1 Unit objectives .................................................................................................. 12-3 Overview: Data Rules, Rule Sets, Metrics - Information Analyzer..................... 12-4 What is a data rule? ......................................................................................... 12-5 Some guiding concepts .................................................................................... 12-6 Components ..................................................................................................... 12-7 Organized by category ..................................................................................... 12-8 Category view................................................................................................... 12-9 Data rule definition: Abstract rules .................................................................. 12-10 Logical rules ................................................................................................... 12-11 Executable rules ............................................................................................. 12-12 Predefined rules ............................................................................................. 12-13 IBM supplied predefined rules ........................................................................ 12-14 Benchmarks ................................................................................................... 12-15 Rule versus rule set ........................................................................................ 12-16 Rules and rule set execution results ............................................................... 12-17 User-Named rule output tables - overview ...................................................... 12-18 User-Named rule output tables - defining ....................................................... 12-19 User-Named rule output tables - simple ......................................................... 12-20 User-Named rule output tables - advanced .................................................... 12-21 User-Named output tables - auto-registration ................................................. 12-22 Define the IADB as a source .......................................................................... 12-23 Set IADB as a project data source.................................................................. 12-24 Select option on rule bindings......................................................................... 12-25 Purging output tables - manual method .......................................................... 12-26 Purging output tables - global solution ............................................................ 12-27 Purging output tables - automatic method ...................................................... 12-28 Purging output tables - per rule ...................................................................... 12-29 Metrics............................................................................................................ 12-30 Metrics guiding concepts ................................................................................ 12-31 Summary of Information Analyzer quality controls .......................................... 12-32 Checkpoint ..................................................................................................... 12-33 Checkpoint solutions ...................................................................................... 12-34

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

P-10

Preface

Data Quality Demonstrations .......................................................................... 12-35 Demonstration 1: Data Rules using logical variables ...................................... 12-36 Demonstration 2: Data Rules using functions ................................................. 12-41 Demonstration 3: Test a data rule definition ................................................... 12-46 Demonstration 4: Manage Output Tables ....................................................... 12-50 Demonstration 5: Bundle related rules into a rule set ..................................... 12-58 Demonstration 6: Organize with folders .......................................................... 12-62 Demonstration 7: Metrics................................................................................ 12-65 Demonstration 8: View summary statistics on My Home ................................ 12-68 Unit summary ................................................................................................. 12-71

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

P-11

Preface

Course overview Preface overview This course introduces the concepts and methods used to perform information analysis. IBM InfoSphere Information Analyzer, Information Governance Catalog, and QualityStage will be used to perform data profiling, data assessment and metadata enrichment tasks. Students will learn how to use the IBM InfoSphere suite to analyze data and report results to business users. Information discovered during analysis will be used to construct data rules. This course will also explore techniques for delivering data analysis results to ETL developers and show how to develop more meaningful metadata to reflect data discovery results. An information analysis methodology and a case study will be used to guide hands-on labs.

Intended audience This is a basic course for business data analysts who want to profile and assess data using Information Analyzer, also data quality analysts who need to measure data quality.

Topics covered Topics covered in this course include: • Information Analysis concepts • Information Server overview • Information Analyzer overview • Information Analyzer Setup • Column analysis  Concepts  Basic data profiling techniques in practice • Data profiling techniques • Primary key analysis  Concepts  Basic data profiling techniques in practice

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

P-12

Preface

• Foreign key and cross domain analysis  Concepts  Basic data profiling techniques in practice • Baseline analysis • Reporting and publishing • Data Rules and Metrics

Course prerequisites Participants should have: • No prerequisites

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

P-13

Preface

Document conventions Conventions used in this guide follow Microsoft Windows application standards, where applicable. As well, the following conventions are observed: • Bold: Bold style is used in demonstration and exercise step-by-step solutions to indicate a user interface element that is actively selected or text that must be typed by the participant. • Italic: Used to reference book titles. • CAPITALIZATION: All file names, table names, column names, and folder names appear in this guide exactly as they appear in the application. To keep capitalization consistent with this guide, type text exactly as shown.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

P-14

Preface

Additional training resources • Visit IBM Analytics Product Training and Certification on the IBM website for details on: • Instructor-led training in a classroom or online • Self-paced training that fits your needs and schedule • Comprehensive curricula and training paths that help you identify the courses that are right for you • IBM Analytics Certification program • Other resources that will enhance your success with IBM Analytics Software • For the URL relevant to your training requirements outlined above, bookmark: • Information Management portfolio: http://www-01.ibm.com/software/data/education/

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

P-15

Preface

IBM product help Help type

When to use

Location

Taskoriented

You are working in the product and IBM Product - Help link you need specific task-oriented help.

Books for Printing (.pdf)

Start/Programs/IBM You want to use search engines to Product/Documentation find information. You can then print out selected pages, a section, or the whole book. Use Step-by-Step online books (.pdf) if you want to know how to complete a task but prefer to read about it in a book. The Step-by-Step online books contain the same information as the online help, but the method of presentation is different.

IBM on the Web

You want to access any of the following: • IBM - Training and Certification

• http://www-01.ibm.com/ software/analytics/trainingand-certification/

• Online support

• http://www-947.ibm.com/ support/entry/portal/ Overview/Software

• IBM Web site

• http://www.ibm.com

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

P-16

Information analysis overview

Information analysis overview

Information Analyzer v11.5 © Copyright IBM Corporation 2016 Course materials may not be reproduced in whole or in part without the written permission of IBM.

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-2

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

Unit objectives

• Describe the major functions of: 

Data profiling



Data analysis

• List the tools used in profiling and analysis

Information analysis overview

© Copyright IBM Corporation 2016

Unit objectives

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-3

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

Information Analysis context and problem description

• Large number of complex sources in enterprises

• New systems require

integration of existing data

• Important data questions to address: 

What does the data mean?



Can we use it as a source for a new system?



Can we integrate data from different sources?

Information analysis overview

© Copyright IBM Corporation 2016

Information analysis context and problem description

In this environment, no single system is the universally agreed to system of record for specific elements of information. Instead, to get a complete view, you have to look across many systems, and since the relationships between data in different systems are not always understood, this is not always possible. In addition, redundancy of data disrupts the ability to get a complete view. It’s frequently not a data entry problem, it’s a data integration / reconciliation problem. It’s too late and too expensive to fix the data after implementation.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-4

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

Source system data assessment

• Measures the suitability of a data source for a designated purpose within the context of a project

Target System

Candidate Data Sources

Project

Data Warehouse ERP CRM

Business Requirements Data “re-purposed”

Information analysis overview

© Copyright IBM Corporation 2016

System source data assessment

Data sources will likely contain data in a format that was suitable for the original business purpose. However, when that same data is examined for suitability in a new project, what was good data quality can become poor quality data. The Data Quality Assessment process examines the candidate source data from the perspective of suitability for the target system given the constraints of the project business requirements.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-5

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

Data assessment processes

• Data profiling: 

Collect statistics about a data source



Assess metadata - is the metadata accurate?

• Data analysis: 

Look at and summarize data



Draw conclusions

Information analysis overview

© Copyright IBM Corporation 2016

Data assessment process

Data profiling definition from Wikipedia, the free encyclopedia: “Data profiling” is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data. The purpose of these statistics may be to: • Find out whether existing data can easily be used for other purposes. • Give metrics on data quality including whether the data conforms to company standards. • Assess the risk involved in integrating data for new applications, including the challenges of joins. • Track data quality. • Assess whether metadata accurately describes the actual values in the source database.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-6

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

• Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can incur time delays and project cost overruns. • Have an enterprise view of all data, for uses such as Master Data Management where key data is needed, or Data governance for improving data quality”.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-7

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

What does data profiling provide? Source System Analysis

- Provides the key understanding of the source data:

Iterative Analysis

- Leverages the analysis to facilitate iterative tests

Primary Key Analysis

Column Analysis

Column Domain analysis Table/Primary Key analysis Foreign Key analysis Cross-domain analysis

Source 2

Source 1

Foreign Key & cross domain Analysis

Baseline analysis

Source 1 Baseline Analysis

Information analysis overview

© Copyright IBM Corporation 2016

What does data profiling provide?

Data profiling provides the basis for information analysis. It asks the question, what does the data really look like? Data profiling is thus a structured process to discover the characteristics of the data. This process can be performed in a sequential fashion, column analysis followed by table analysis followed by foreign key analysis. Or, it can take a somewhat unstructured approach with interactive cycles performed as needed. At any time following column analysis, baseline analysis can be used to establish a reference point. You can then go back and repeat portions of the data analysis and compare them to the baseline.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-8

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

What does data analysis add? Source System Analysis - Enriches profiling content with:

Metadata definitions and Terms

Annotations for information or action Validation of structural properties Validation of domains and formats Validation of keys Identification of redundancies

- Delivers information through:

Reports Shared Metadata Published Analytical Results

Information analysis overview

© Copyright IBM Corporation 2016

What does data analysis add?

Data analysis takes the results of data profiling and adds metadata definitions, annotations and data validations. Information derived from data analysis can then be communicated to the rest of the team via reports, metadata sharing, and publication results that push the analytical findings Into the environment where the ETL developers work.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-9

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

Subject matter experts’ role

• Subject matter experts are a critical success factor (probably the most critical) of a data assessment.

• Without their involvement, a data assessment is futile. • Speak their language (establish understanding of business terms using Information Governance Catalog).

• Have supporting materials - data examples and reports.

Information analysis overview

© Copyright IBM Corporation 2016

Subject matter experts' role

Subject matter experts understand the data content and how it relates to business processes. Without their input data analysis is futile – and therefore, Inclusion of appropriate subject matter experts in a project is truly a critical success factor.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-10

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

IBM InfoSphere Suite used in data assessment

• Information Analyzer: 

Discover Domain, Structure and Relationships (physical) of data



Add terms and definitions



Link objects to data fields and tables



Build and test data rules

• Exception Manager 

Build data rules and test data for compliance

• Information Governance Catalog: 

Document the business users’ language: − Add

terms

− Create



categories (implement hierarchies for terms)

Identify data stewards

• QualityStage 

Perform pattern investigation for free-form fields

Information analysis overview

© Copyright IBM Corporation 2016

IBM InfoSphere Suite used in data assessment

Information Analyzer, QualityStage, and Information Governance Catalog are components of Information Server. Information Server provides several tools for performing data assessment. These tools include Information Analyzer, Exception Manager, Information Governance Catalog, and Quality Stage. Each tool provides capabilities that are unique to that tool.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-11

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

Information Server data quality assessment tools

• Focus on targeted areas of understanding • Build a business case for further assessment, monitoring • Value Frequency Distribution • Validation summary report • Pattern Distribution • Failed rules by record report • Data Type Analysis … • Duplicates analysis

- DataStage Pre-Process Data*

Discovered Rules

Analyze Data

Measure Quality

- Information Analyzer - QualityStage

Known Rules Information analysis overview

© Copyright IBM Corporation 2016

Information server data quality assessment tools

Data quality assessment can use a variety of tools from the Information Server suite: • DataStage • Information Analyzer • QualityStage • Information Governance Catalog The box labeled “pre-process data” represents the tasks needed to extract the data from the source database, possibly transform the data, and load it into the staging area. Although this step is not required, it is often completed to isolate the source data that will be used for information analysis.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-12

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

Information Analyzer features

• Column Analysis 

Histogram values



Identify invalid values

• Key identification • • • •



Primary



Foreign

Identify data redundancy Notes attached to columns and tables Create data rules Produce reports

Information analysis overview

© Copyright IBM Corporation 2016

Information Analyzer features

Information Analyzer can analyze any source system that it can connect to via ODBC. The minimum information it needs is Table and Column Names. Column Analysis (CA): Based on actual data values (not metadata); determines the true physical characteristics of the data such as, data type, precision, scale, nullability, etc. Also calculates the frequency distribution, identifies the distinct values, and can create a sample data file. Column Analysis replaces the manual, time consuming, error prone process of traditional data analysis. • Primary and Foreign Key Analyses. • Works with a random sample of data. • Identifies the Primary Key candidates. • Identifies candidate Foreign Key relationships. Cross-Table Analysis (XT): Cross- Table Analysis compares distinct values from a column against distinct values from columns in other tables. During the analysis, the goal is to detect columns that share a common domain. Identifies potential redundant data, potential referential integrity issues or uncover unknown data rules.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-13

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

QualityStage features

• Provides specialized data quality processing: 

Clean and standardize data



Remove/reconcile duplicates

• Provides visual tools for designing quality rules and matching logic 

Integrated with DataStage

Information analysis overview

InfoSphere QualityStage™

© Copyright IBM Corporation 2016

QualityStage features

InfoSphere QualityStage is the ‘Cleanse’ functionality of IBM Information Server. The quality functions include: • Free-form text investigation - allowing you to recognize and parse out individual fields of data from free-form text. • Standardization - allowing individual fields to be made uniform according to your own standards. • Address verification and correction - which uses postal information to standardize, validate, and enrich address data. • Matching - which allows duplicates to be removed from individual sources, and common records across sources to be identified and linked. • Survivorship - which allows the best data from across different systems to be merged into a consolidated record. The true power of QualityStage is in its ability to match data from different records, even when it appears very different. Because of this ability to match records, QualityStage is a key enabler of creating a single view of customers or products.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-14

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

What tool to use when

• Data Quality Assessment generally starts with Information Analyzer: 

Discovers data condition quickly



Indicates potential domain integrity issues



Identifies potential structural/relational integrity issues



Tests adherence of data to data rules

• Further analysis then performed with QualityStage: 

Analyzes free-form data



Provides pattern investigation on domain integrity issues



Helps determine standardization and matching requirements for duplicate data issues

• Information Governance Catalog: 

Documents the business vocabulary



Terms can be linked to physical data objects

Information analysis overview

© Copyright IBM Corporation 2016

What tool to use when

QualityStage is a product that is targeted for data cleansing (standardization) requirements of free-form fields such as Name, Address, descriptions, and resolving (matching) duplicate record issues involving these free-form fields. Not every DQA effort will require the use of QualityStage.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-15

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

Data assessment path: Functional view Domain Integrity • Lexical Analysis • Pattern Consistency

IT Data Steward Staged Source(s)

Information Analyzer Full Volume Profiling

QualityStage Targeted Columns

Entity Integrity • Duplicate Analysis • Targeted Data Accuracy

SME

QualityStage All Targeted Information Entities Report Review

metadata Integrity Domain Integrity • Completeness • Consistency • Create Reference Tables Structural Integrity • Key Analysis Relational Integrity • Cross-Table Analysis

Information Analyzer Targeted Columns

Data Alignment Decisions

• Data rule identification and validation • Data exception remediation

Information analysis overview

© Copyright IBM Corporation 2016

Data assessment path: Functional view

The exact data assessment path used will vary from project to project. It will also vary somewhat depending on the particular findings at any point in time. For instance, column analysis (as performed by Information Analyzer) might reveal data conditions that need to be further explored using QualityStage or Information Analyzer. In this module we will investigate the role of QualityStage.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-16

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

Make data profiling a process

• Establish a profiling and assessment process: 

Identify: − Profiling

requirements – set goals

− Candidate − Security

needs

− Additional



sources requirements (accessibility, sensitivity, and availability)

Build execution plan that identifies: − Who − What − When



Identify any needs for further exploration



Leverage the metadata repository



Update the data periodically

Information analysis overview

© Copyright IBM Corporation 2016

Make data profiling a process

Identify profiling project requirements. Projects which try to do too much in one pass generally fail. Remember your overall goals. Identify what is relevant. What data sources will be included? Who can assess the data sources (e.g. make annotations, etc.)? Document the potential sources. Profile what you expect to use and weed out and annotate the extraneous.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-17

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

Checkpoint

• Data assessment consists of what two processes? • Which two InfoSphere Information Server tools can be used to measure data quality?

• Which InfoSphere Information Server component can be used to capture the business user’s language?

Information analysis overview

© Copyright IBM Corporation 2016

Checkpoint

Answer the checkpoint questions to quickly check your mastery of the presentation material in this unit.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-18

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

Checkpoint solutions 1. Data assessment consists of what two processes? Data profiling and data analysis

2. Which two InfoSphere Information Server tools can be used to measure data quality? Information Analyzer and QualityStage

3. Which InfoSphere Information Server component can be used to capture the business user’s language? Information Governance Catalog

Information analysis overview

© Copyright IBM Corporation 2016

Checkpoint solutions

Answers to the checkpoint questions provided here.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-19

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

Demonstration 1 Read case study

Information analysis overview

© Copyright IBM Corporation 2016

Demonstration 1: Read case study

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-20

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

Demonstration 1: Read case study Purpose: Introduction to the Chemco data warehouse case study. Describe the business requirements for the ChemCo Data Warehouse course project.

Task 1. Read case study. Executive Summary ChemCo Corporation is a leader in the wholesale chemical supply marketplace, providing their customers with a wide range of chemical intermediate manufacturing products, such as hexchloride, propanol, and ammonia. ChemCo Corporation made the strategic decision to build a decision support system consisting of a central data warehouse which will in turn feed several analysis databases. A comprehensive understanding of the data that will source this data warehouse is critical to estimate needed data cleansing and ETL programming efforts. Company Stats: Name of Business: ChemCo Corporation Type: Chemical supply Organizational structure: 12 regional warehouses with corporate headquarters in Denver, Colorado. The Business Challenge ChemCo wants to build a global, unified view of their product and customer data. To select a trusted system of record, ChemCo must first investigate data quality issues. Source Systems and Issues ChemCo Corporation has identified multiple data sources as feeds to the data warehouse. The potential source systems vary in data quality and use different methods for identifying customers. These issues are a serious concern of the management and they would like to see a comprehensive plan for addressing these problems. The challenge is to identify rules for cleansing the data to provide consolidated views of the data across all sources. Existing systems are: • Customer Sales • Inventory • Finance © Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-21

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

Data requirements: • Customer name information is spread across free-form text fields. Business users would like to see this organized into specific fields. • Remove all duplicate customer records. • Establish a unique customer profile. • All blank entries exist in some fields. Blanks and nulls (no value whatever) should be treated as invalid entries (not true of the current systems). • Sales information must be accurate and conform to documented business rules, especially all computed data fields. You have been assigned to the project in the role of Data Analyst and are charged with the task of performing a Data Quality Assessment on the Sales data. Results: You have been introduced to the Chemco data warehouse case study. You have read the business requirements for the ChemCo Data Warehouse course project.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-22

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

Demonstration 2 Read project scenario

Information analysis overview

© Copyright IBM Corporation 2016

Demonstration 2: Read project scenario

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-23

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

Demonstration 2: Read project scenario Purpose: Understand and describe the business and project requirements for the ChemCo Data Warehouse project.

Task 1. Read the ChemCo project approach. 1.

A project team has been assembled to perform a Data Quality Assessment of the ChemCo data. This demonstration describes the makeup of the project team. Review the ChemCo Data Warehouse project plan and staff assignments. This is a reading demonstration to explain the project configuration to support data analysis for the business case. This is meant to simulate a real project configuration and how it is staffed.

2.

The following ChemCo project definition establishes business requirements and identifies candidate source data.

ChemCo management has decided to use a project methodology comprised of several phases: 1. 2. 3. 4. 5.

Analysis Design Construction Testing Implementation

During the analysis phase the project manager wants to have project roles assigned, user IDs created and given access to software, potential source data identified and assessed, and a data warehouse data model created. You have two roles: • InfoSphere Software Administrator (for this demonstration only) • Data Analyst (for all remaining demonstrations)

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-24

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

Your project role is that of a Data Analyst. You have been asked to participate in source system assessment, test data design, and end user acceptance testing; consequently, you will participate in all project phases. Your first task is to understand the project business requirements and then perform a data assessment on the potential source data; the problems you discover should be documented and reported to the full project team since your results will be used to assess data cleansing requirements. Using DataStage, the source data has been extracted and stored in sequential flat files. Results: You have read the business and project requirements for the ChemCo Data Warehouse project.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-25

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

Project information 1. Project business requirements: a. Clean customer data – no duplicate records b. Clean sales data that can easily link to customer records with all computed data correct c. Required data warehouse metrics I. Total profit by customer II. Profit margin by customer III. Sales data must be accurate and conform to documented business rules, especially all computed values 2. Project approach: a. Security: Data analysis will be restricted to particular data stewards and data analysts. b. Data staging: Data will be extracted from the online DB2 database and stored as a set of sequential files. Therefore, data analysis will be performed on a frozen copy of the live data. This may cause some issues because the live online database could undergo structural modification and undergo record updates from online users; these data changes will not be immediately available to the data assessment team. This problem will be addressed after the first wave of information analysis is completed. c. Data Analyst roles: I. James Harris – userid jharris II. Bob LeClair – userid bleclair d. Data Stewards: I. Bill Betz – userid bbetz II. Doug Smith – userid dsmith e. Subject Matter Experts: I. Diane Weir II. Karen Everett III. Pete Scobby

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-26

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

f. Data profiling review checkpoints: I. Column analysis –domain sets for critical data elements established II. Record identifiers selected for each major table III. Identification of reference tables needed for transformation functions 3. Critical data elements 1. Tables and columns 1. CUSTOMER 1. CUSTID 2. CUSTNAME 3. CREDCODE 2. CARRIER 1. CARRIERID 3. VENDOR 1. VENDNO 4. ORD_HDR 5. ORD_DTL 6. ITM_MSTR 7. UNITCTLG 2. Reference tables that should be created a. Credit rating b. Item master 4. Data problems that need identification a. Data duplication - how is this identified? b. Customer keys that are not unique c. Blank or null data columns d. Incorrect connections between customers and sales information e. Any other data quality issues that will interfere with correct identification of customers and products f. Any data quality issues that will prevent correct calculation of project metrics – total customer sales and total product sales

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-27

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

Demonstration 3 Review Chemco data

Information analysis overview

© Copyright IBM Corporation 2016

Demonstration 3: Review Chemco data

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-28

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

Demonstration 3: Review Chemco data Purpose: Become familiar with Chemco source data.

Task 1. Locate ChemCo sequential files on virtual machine. 1.

2.

The data files to be analyzed in this course are contained in the C:\CourseData\KM803Files\Chemco\Seq folder on your VM Windows machine. Open this folder and verify you have 15 files present - 11 have a .txt extension, 3 have an .rpt extension and one has an .INI extension. Using Notepad, open the CUSTOMER.txt file. Note that the first record is not true data - rather, it contains the column names for the customer.txt file. The QETXT.INI file will compensate for this by using the FLN=1 parameter setting. This will direct the ODBC driver to skip the first record when presenting source data to Information Analyzer.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-29

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

3.

Open the QETXT.INI file. QETXT.INI is an ODBC configuration file. It describes the files within the sequential database directory. For example, if you use a text editor to open the file you can find the entry for the CUSTOMER.txt file described earlier. Note the file name, first data line number switch, delimiter, and column definitions. A portion of QETXT.INI is shown below:

Results: You have become familiar with Chemco source data.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-30

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

Unit summary

• Describe the major functions of: 

Data profiling



Data analysis

• List the tools used in profiling and analysis

Information analysis overview

© Copyright IBM Corporation 2016

Unit summary

You should now be able to perform the functions listed on this slide.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-31

U n i t 1 I n f o r m a t i o n a n a l y s i s o ve r vi e w

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

1-32

Information Server overview

Information Server overview

Information Analyzer v11.5 © Copyright IBM Corporation 2016 Course materials may not be reproduced in whole or in part without the written permission of IBM.

U n i t 2 I n f o r m a t i o n S e r ve r o ve r vi e w

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

2-2

U n i t 2 I n f o r m a t i o n S e r ve r o ve r vi e w

Unit objectives

• Describe Information Server architecture • Log onto Information Server Administration • Add users

IBM Information Server Overview

© Copyright IBM Corporation 2016

Unit objectives

Upon completing this unit you should be able to describe Information Server architecture, log onto Information Server administration, and add users.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

2-3

U n i t 2 I n f o r m a t i o n S e r ve r o ve r vi e w

Information Server components IBM Information Server

Discover, model, and govern information structure and content

Standardize, merge, and correct information

Combine and restructure information for new uses

Synchronize, virtualize and move information for in-line delivery

Platform Services

IBM Information Server Overview

© Copyright IBM Corporation 2016

Information Server components

IBM Information Server allows you to: Understand all sources of information within the business and analyze its usage, quality, and relationships. Cleanse the data to assure its quality and consistency. Transform the data to provide enriched and tailored information. Deliver the data to make it accessible to people, processes, and applications. Information Server products that correspond to these functions are: Understand: Information Analyzer Cleanse: QualityStage Transform: DataStage

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

2-4

U n i t 2 I n f o r m a t i o n S e r ve r o ve r vi e w

Deliver: Information Services Director All of these functions are based on a parallel processing infrastructure that provides leverage and automation across the platform. Information Server also provides connectivity to nearly any data or content source, and the ability to deliver information through a variety of mechanisms. Underlying these functions is a unified metadata management foundation that provides sharing of knowledge throughout a project lifecycle, along with a detailed understanding of what information means, where it came from, and how it is related to information in other systems.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

2-5

U n i t 2 I n f o r m a t i o n S e r ve r o ve r vi e w

Architecture

• Application Server running a set of suite and products specific services • Repository:   

All metadata in a shared (referred to as COMMON) model Supports extension models for usage by Individual Products Collection of runtime events/metadata

• Engine 

All processing is carried out by the Parallel Engine

• User specific User Interfaces 

User interface depends on function performed (administrative, profiling, transformation)

• Common Services:    

Reporting Scheduling Security Logging

IBM Information Server Overview

© Copyright IBM Corporation 2016

Architecture

Information Server architecture is comprised of several components: Repository • Database containing Information Server suite objects such as DataStage jobs. Engine • DataStage parallel processing engine. User interfaces • DataStage Administrator. • Designer client. • Director client. • Information Server console. • And others.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

2-6

U n i t 2 I n f o r m a t i o n S e r ve r o ve r vi e w

Information Server a platform more than a product

• Consists of multiple modules that share a common foundation of shared platform services

• Can purchase one or more of the modules 

Each module inherently includes the platform services

• The modules work together: 

As new modules are added, they integrate into the shared platform services



This provides a flexible architecture

IBM Information Server Overview

© Copyright IBM Corporation 2016

Information Server a platform more than a product

Information Server is a platform on which specific product units can be added. Information Analyzer is one of the optional units. Just which units are present is determined at installation time; units can also be added post-installation. The units are all integrated in such a fashion that they work together.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

2-7

U n i t 2 I n f o r m a t i o n S e r ve r o ve r vi e w

Client-Server architecture One Server, Multiple Clients, One Thin Client

Client machines

Server

IBM Information Server Overview

© Copyright IBM Corporation 2016

Client-Server architecture

Client machines contain the user interfaces such as DataStage Designer, Director, and Administrator. The server does most of the work: • Compiles and runs programs, generates output. • Manages the repository. For computer B, components can be on separate servers, but they have to be homogeneous environments for now, i.e. same platform.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

2-8

U n i t 2 I n f o r m a t i o n S e r ve r o ve r vi e w

Client icons Thin Client – Web Browser

Use for: • Administration • Information Governance Catalog

Clients - Microsoft® Windows XP/2003

Use for: • DataStage • QualityStage

Console

Designer

Director

Administrator

• Information Analyzer

Note: FastTrack, Metadata Workbench, Information Manager not shown IBM Information Server Overview

© Copyright IBM Corporation 2016

Client icons

This slide shows the client icons and the software they invoke. One thin (HTML) client for administration to perform the following functions: • Add users and groups. • Configure domain authentication. Fat clients, which need to be connected to Server to perform the following functions: • Console • Administrator • Director • Designer

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

2-9

U n i t 2 I n f o r m a t i o n S e r ve r o ve r vi e w

Using the Information Server thin client

• Provides access to server administration activities: 

Adding and modifying users and groups



Granting user access to Information Server modules



Modify reporting preferences

• Provides user interface to Information Governance Catalog Thin Client

IBM Information Server Overview

© Copyright IBM Corporation 2016

Using the Information Server thin client

A thin client Web interface is used to manage the Information Server. Users can be added and linked to platform components, roles assigned, and engine credentials set. The Web interface also forms the user interface into Information Governance Catalog.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

2-10

U n i t 2 I n f o r m a t i o n S e r ve r o ve r vi e w

Server management: users and groups

IBM Information Server Overview

© Copyright IBM Corporation 2016

Server management: users and groups

A thin client Web interface is used to manage the Information Server users and groups. Users can be added and linked to platform components, roles assigned, and engine credentials set. Click the Administration tab, then open the Users and Groups branch; users will appear in the main pane. Selecting a function from the Task list will give you the option to modify objects or even add new ones.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

2-11

U n i t 2 I n f o r m a t i o n S e r ve r o ve r vi e w

Checkpoint

1. Which Information Server product performs ETL? 2. Which platform service increases processing speed? 3. Which platform component holds metadata?

IBM Information Server Overview

© Copyright IBM Corporation 2016

Checkpoint

Answer the checkpoint questions to quickly check your mastery of the presentation material in this unit.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

2-12

U n i t 2 I n f o r m a t i o n S e r ve r o ve r vi e w

Checkpoint solutions

1. Which Information Server product performs ETL? • DataStage 2. Which platform service increases processing speed? • Parallel processing 3. Which platform component holds metadata? • Repository

IBM Information Server Overview

© Copyright IBM Corporation 2016

Checkpoint solutions

Answers to the checkpoint questions provided here.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

2-13

U n i t 2 I n f o r m a t i o n S e r ve r o ve r vi e w

Demonstration 1 Information Server setup

IBM Information Server Overview

© Copyright IBM Corporation 2016

Demonstration 1: Information Server setup

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

2-14

U n i t 2 I n f o r m a t i o n S e r ve r o ve r vi e w

Demonstration 1: Information Server setup Purpose: Use administrative functions within Information Server to add users and change reporting defaults. Modify report preferences. Describe the steps needed to log onto Information Server administration and view user IDs and their roles. Before a user can log onto Information Analyzer, the Information Server administrator needs to set up a user id and link it to appropriate roles. This is the top level of the Information Server security architecture. This demonstration shows the background security infrastructure that controls user access to Information Server products. From project business requirements the following were assigned the role of Data Analyst: James Harris - userid jharris Bob LeClair - userid bleclair Joyce Weir - userid jweir

Task 1. Information Server logon. 1.

Log onto Information Server: Double click on the IIS Server LaunchPad icon on the Windows desktop.

If the page does not open, you may need to restart the operating system (log on as student/student if prompted).

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

2-15

U n i t 2 I n f o r m a t i o n S e r ve r o ve r vi e w

2.

Click the Administration Console icon.

3.

Enter your username and password. Demonstrations in this course use student as the user ID and password student.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

2-16

U n i t 2 I n f o r m a t i o n S e r ve r o ve r vi e w

Task 2. View users. 1.

Click the Administration tab.

2. 3. 4.

Expand the drop-down window labeled Users and Groups. Click the option labeled Groups. Verify the group IT is present. If the group is not present, on the right side, click the New Group link. Then add a group with a Principal ID and Name of IT. In the Roles section, under Suite and Suite Component, select the Roles check box to select all the roles. Then, in the bottom right corner, click the Save and Close button. In the left pane, click Users. Verify that the following users are present: • jharris • bleclair • jweir If these three users are not present, on the right side, click the New User link to add them. Then specify these credentials for each of the new users:

5. 6.

• jharris: User Name, Password, and Confirm Password is "jharris", First Name is "James", and Last Name is "Harris" • bleclair: User Name, Password, and Confirm Password is "bleclair", First Name is "Bob", and Last Name is "LeClair" • jweir: User Name, Password, and Confirm Password is "jweir", First Name is "Joyce", and Last Name is "Weir" After you have added each user, in the bottom right corner, click the Save and Close button. The role assignments give each person access to functions within the IS product suite but are not specific to any particular project. You will do more with assigning roles for these users in a later demonstration when you create projects. © Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

2-17

U n i t 2 I n f o r m a t i o n S e r ve r o ve r vi e w

Task 3. Modify reporting. Reports are used to communicate your data analysis findings to the entire project team. You will normally use the reporting functions found in the Information Server client – not the Administration console. However, some reporting controls are found only in the Administration console so the next steps demonstrate how to find and modify some report settings. 1. 2. 3.

Click the Reporting tab. Click the Preferences option. Change the default expiration to expire after 2 days.

4. 5.

Click the Save button located in the lower right portion of your window. Click the Log Out button located in the upper right portion of your window.

Results: You logged onto the server and viewed the users and groups defined to the system. You changed the reporting preferences.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

2-18

U n i t 2 I n f o r m a t i o n S e r ve r o ve r vi e w

Unit summary

• Describe Information Server architecture • Log onto Information Server Administration • Add users

IBM Information Server Overview

© Copyright IBM Corporation 2016

Unit summary

You should now be able to perform the functions listed on this slide.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

2-19

U n i t 2 I n f o r m a t i o n S e r ve r o ve r vi e w

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

2-20

Information Analyzer overview

Information Analyzer overview

Information Analyzer v11.5 © Copyright IBM Corporation 2016 Course materials may not be reproduced in whole or in part without the written permission of IBM.

U n i t 3 I n f o r m a t i o n A n a l y z e r o ve r vi e w

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

3-2

U n i t 3 I n f o r m a t i o n A n a l y z e r o ve r vi e w

Unit objectives

• Describe the major functions of Information Analyzer • Explain the concept of data profiling

Information Analyzer overview

© Copyright IBM Corporation 2016

Unit objectives

After completing this unit the student should be able to both describe the major functions of Information Analyzer and explain the concept of data profiling.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

3-3

U n i t 3 I n f o r m a t i o n A n a l y z e r o ve r vi e w

InfoSphere Information Analyzer

• What does it do? • Analyzes data sources to discover structure, contents and quality of information 

Infers the reality of the data, not just the data definition



Finds and reports missing, inaccurate and inconsistent data



Allows review of the quality of data throughout the life cycle

• Who uses it? • Business and Data Analysts, Data Quality Specialists, Data Architects and Data Stewards

Information Analyzer overview

© Copyright IBM Corporation 2016

InfoSphere Information Analyzer

Information Analyzer infers what a data structure should be by analyzing column content. This means that Information Analyzer should ideally read every record for a particular column to discover such things as minimum and maximum lengths. By examining the contents Information Analyzer reports what is, not what we think it should be, based on the metadata. This process is controlled by Business and Data Analysts, Data Quality Architects, and Data Stewards.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

3-4

U n i t 3 I n f o r m a t i o n A n a l y z e r o ve r vi e w

Profiling and analysis functionality

• • • • • • • •

Column, table, cross-table Summary and detail levels Drill down Frequencies Completeness and validity Current-to-prior comparisons Key analysis/violations Reference table generation

Information Analyzer overview

© Copyright IBM Corporation 2016

Profiling and analysis functionality

Profiling is composed of each of the functions on the slide; this is performed automatically by the Information Analyzer engine. Once the profile in process has been completed the Data Analyst reviews the results and makes adjustments. Analytical information is displayed on Information Analyzer screens and presented to the Data Analyst for review. The analyst can either agree and accept the Information Analyzer results or change them. However, the source data is never changed.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

3-5

U n i t 3 I n f o r m a t i o n A n a l y z e r o ve r vi e w

Reporting

• Nearly 40 out-of-the-box reports • Customizable for: 

Logos



Report Names



Relevant Parameters

• Ability to include analytical notes • Delivered in User Interface or via Web browser

Information Analyzer overview

© Copyright IBM Corporation 2016

Reporting

Information Analyzer reporting is a service that is supplied by the Information Server platform. Numerous report templates can be used to provide reports for the data quality assessment team. These reports can be customized with regards to Logos, report names, and relevant parameters. The analyst also has given the ability to add notes and all of these functions are provided within the Information Analyzer GUI.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

3-6

U n i t 3 I n f o r m a t i o n A n a l y z e r o ve r vi e w

Security • Multi-level security and administration framework: Suite  Product  Project  Data source User, role, and privilege assignment 

•

Information Analyzer overview

© Copyright IBM Corporation 2016

Security

Security is another service that is provided by the Information Server platform. The security unit can either operate in a standalone fashion (known as the internal registry) or interface with the server’s OS or LDAP. Users can be added to an Information Analyzer project by using the Users tab. Roles can be assigned for each user.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

3-7

U n i t 3 I n f o r m a t i o n A n a l y z e r o ve r vi e w

Shared metadata

• • • • •

Common Connectivity Metadata discovery shared across Suite Projects register interest only in data sources of concern Metadata Import focused on user interest Analytical results published in secured framework

Information Analyzer overview

© Copyright IBM Corporation 2016

Shared metadata

Metadata can be shared across all components within the Information Server suite. This facilitates metadata sharing between Information Analyzer, DataStage, QualityStage, FastTrack, and the Information Governance Catalog. Consequently, data profiling results can be visible to the ETL development team. Information Analyzer projects register interest in a data source whose metadata has already been imported. Each project gets its own set of internal tables that store the results of the various Information Analyzer analyses. The analytical results can be published in such a way that they are available to other units in the Information Server framework; in this way Information Analyzer analyses can be made available to DataStage ETL developers.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

3-8

U n i t 3 I n f o r m a t i o n A n a l y z e r o ve r vi e w

Analysis execution architecture

• Builds DataStage jobs (referred to in IA as scripts) • Uses efficient techniques to perform analysis for: 

Column Analysis − Performs



a single scan of a table for N columns. There is no memory constraint.

Primary Key − Separate

Single Column Primary Key from Multiple Primary Key Analysis eliminates unnecessary analysis processing when single PK is found.



Cross Domain (including identification of foreign keys) − Column

compatibility comparisons made by including Data Classification equality as a requirement results in smaller set of candidate columns being analyzed, thus faster execution.



Referential Integrity Analysis − Uses

Frequency Distribution results already stored in the repository to perform analysis.

Information Analyzer overview

© Copyright IBM Corporation 2016

Analysis execution architecture

Information Analyzer builds DataStage jobs to perform data profiling functions. However, the Information Analyzer data analyst does not need to understand DataStage programming; instead, DataStage jobs are built behind the scenes and submitted to the parallel engine for execution. Because Information Server understands how to efficiently build DataStage jobs that perform well, Information Analyzer analysis jobs perform their analysis functions in the most optimal fashion. One example of this occurs when Information Analyzer performs primary key analysis. In this case Information Analyzer uses the frequency distribution tables previously built from column analysis.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

3-9

U n i t 3 I n f o r m a t i o n A n a l y z e r o ve r vi e w

Information Analyzer: Login

• Prerequisites: 

A valid Information Server Userid



Information Server profiles created with proper roles assigned

• Steps: 

From the Windows desktop, locate the Console for IBM Information Server icon and then double-click the icon



OR From the Windows Start program menu, locate the IBM Information Server Console program and click the selection.



Enter User Name



Enter Password



Enter Server where Server is equal to a predefined host name containing the IBM Information Server



Click Login OR press key

Information Analyzer overview

© Copyright IBM Corporation 2016

Information Analyzer: Login

To log onto Information Analyzer you must first have a valid user ID and security profile. The user ID is stored internally in Information Server and roles are assigned; i.e. Information Analyzer user role. The server name must be the exact computer name; you cannot simply supply the TCPIP address only. This slide provides the step details required to log on.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

3-10

U n i t 3 I n f o r m a t i o n A n a l y z e r o ve r vi e w

Information Analyzer: Home page

• Home page: 

Starting point for all work



Review Getting Started



Open Projects



Navigate to Open Workspaces

Information Analyzer overview

© Copyright IBM Corporation 2016

Information Analyzer: Home page

This graphic shows the Information Analyzer home page. It can be configured using Edit -> Preferences. So the Home Page can be modified as you become more familiar with Information Analyzer features.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

3-11

U n i t 3 I n f o r m a t i o n A n a l y z e r o ve r vi e w

Pillar menus

• Methodology-driven Navigation • Each pillar represents a different

The Pillar Menu

portion of the lifecycle: 

Overview (Project configuration)



Investigate



Develop



Operate

• Manage multiple workspaces • Dock-able or floatable tabs containing useful information or tools

Information Analyzer overview

© Copyright IBM Corporation 2016

Pillar menus

The Information Analyzer GUI contains two types of menus: File and Pillar. The File menu system is the familiar type seen on most Windows applications. The pillar menus, as depicted on this slide, are used to bring up Information Analyzer functions such as column analysis and primary key analysis. Multiple analyses can be opened at the same time allowing the user to move from one function to another by using multiple workspaces.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

3-12

U n i t 3 I n f o r m a t i o n A n a l y z e r o ve r vi e w

Online documentation Getting Started Guide

• Suggests what to do next: 

Provides overview of tasks



Documents logical sequences of events

• Contains links to related items and tasks Help

• By product • Searchable

Information Analyzer overview

© Copyright IBM Corporation 2016

Online documentation

Documentation is found in several places Reference manuals These are PDF documents that can be downloaded from the IBM Information Center. Help text The Information Analyzer GUI provides much information via the help facility. Getting Started Guide that appears on the home screen The Getting Started Guide contains information to help the new Information Analyzer user. This information is sprinkled with convenient hyperlinks.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

3-13

U n i t 3 I n f o r m a t i o n A n a l y z e r o ve r vi e w

User interface features

• Do multiple things at once with Workspaces and Task Panes

Multiple Workspaces

• Non-blocking dialogs • Leave a problem and come back to it later (using History palette)

Information Analyzer overview

© Copyright IBM Corporation 2016

User interface features

User interface features Task panes list functionality appropriate to selected objects. Workspaces can be saved as tabs, allowing the user to move from one active function to another with a minimum of effort. Of particular value is the History tab; you can return to a screen that had been closed but that you now want to revisit; however, History is only valid for the current session. Dialogs are non-blocking. This means that you can have several dialogs open at the same time without interference.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

3-14

U n i t 3 I n f o r m a t i o n A n a l y z e r o ve r vi e w

Manage information displayed • Automatic collapsible panes • Hide unneeded information • Reduce clutter – No stacked windows

• Drilldown progressively

discloses what is important

• Green and red icon eye catchers

Column analysis anomaly Information Analyzer overview

Primary Key candidate © Copyright IBM Corporation 2016

Manage information displayed

Column windows can be open or closed, thus allowing the user to reduce screen clutter and only view data as needed. These show/hide bars help manage screen clutter. Colored icons are used throughout Information Analyzer – red usually indicates an anomaly and green indicates a candidate for something, such as primary key.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

3-15

U n i t 3 I n f o r m a t i o n A n a l y z e r o ve r vi e w

Display details graphically • Graphical enablement and display of key analytical data

Information Analyzer overview

© Copyright IBM Corporation 2016

Display details graphically

Information Analyzer displays analysis results in both grid and graphical format. Frequently the user can switch from one view to another simply by clicking relevant buttons (shown in lower left portion of the graphic.)

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

3-16

U n i t 3 I n f o r m a t i o n A n a l y z e r o ve r vi e w

Set preferences

• Set configuration options • Set user preferences 

Edit -> Preferences

• Size or move panels to your needs

Information Analyzer overview

© Copyright IBM Corporation 2016

Set preferences

You can modify configuration options to change the appearance of the startup screen. You can also modify user preferences to best suit your way of working. For example, you can specify preferences for startup, change the behavior of panes, and customize the status bar. To modify user preferences: Select Edit > Preferences. In the User Preferences window, select the type of preferences that you want to modify. Modify the available options. Click OK to close the window and save your changes. You can also either resize or move panels.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

3-17

U n i t 3 I n f o r m a t i o n A n a l y z e r o ve r vi e w

Checkpoint

1. True or False? The Information Analyzer GUI is methodology driven. 2. True or False? The Pillar menus provide access to the underlying Information Analyzer function.

3. True or False? The Information Analyzer palettes allow you to define business analyst users.

Information Analyzer overview

© Copyright IBM Corporation 2016

Checkpoint

Answer the checkpoint questions to test your mastery of the material presented.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

3-18

U n i t 3 I n f o r m a t i o n A n a l y z e r o ve r vi e w

Checkpoint solutions 1. True or False? The Information Analyzer GUI is methodology driven. True 2. True or False? The Pillar menus provide access to the underlying Information Analyzer function. True 3. True or False? The Information Analyzer palettes allow you to define business analyst users. False

Information Analyzer overview

© Copyright IBM Corporation 2016

Checkpoint solutions

Answers to the checkpoint questions are provided here.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

3-19

U n i t 3 I n f o r m a t i o n A n a l y z e r o ve r vi e w

Demonstration 1 Information Analyzer tour

• Explore navigation and help

Information Analyzer overview

© Copyright IBM Corporation 2016

Demonstration 1: Information Analyzer tour

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

3-20

U n i t 3 I n f o r m a t i o n A n a l y z e r o ve r vi e w

Demonstration 1: Information Analyzer tour Purpose: Guided tour through the Information Analyzer GUI. Navigate through Information Analyzer and locate the primary functions. The GUI for Information Analyzer contains standard file menus and also a custom Pillar menu.

Task 1. Logon to Information Analyzer. 1.

Launch IBM InfoSphere Information Server Console from the Desktop

2.

The user ID and password used in this course are student/student. Note: If you get a red flag next to the Server text box then you either entered the wrong name for the server or Information Server is not running.

Task 2. Explore the user interface. The five pillar menus are located in the upper left portion of your screen

1.

Click each pillar menu. Some menus have options that are grayed out. Most of these grayed out options can only be performed in the context of an open project. I. Home pillar menu: This is used for product configuration. Note that all options are available, yet no project has been selected. II. Overview pillar menu: Project level properties and dashboard are here – valid for project context only. III. Investigate pillar menu: This is used to start each investigation type. Valid for project context only.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

3-21

U n i t 3 I n f o r m a t i o n A n a l y z e r o ve r vi e w

Develop pillar menu: Data Quality functions can be started here. Note: If you do not see the Data Quality entry, then your user ID needs to have the Rules role assigned in the Information Server Administration Console. V. Operate pillar menu: Log and scheduling views used to help troubleshooting efforts for analysis jobs -- project context is not necessary; these functions can also be performed from the Information Server Web Console. In addition to the pillar menus, Information Analyzer has file menus. 2. Click the Edit menu and then click Preferences. 3. Click the Web conferencing compatibility checkbox to select it. This option controls the appearance of the Information Analyzer user interface during Internet presentations. 4. Select Show Analysis tab on Dashboard in the Information Analysis folder (if it is not already selected). Enabling this option will influence your starting page when opening a project. 5. Click the Status Bar option under Select View and then uncheck the Show activity animation in status bar checkbox. This will remove a progress bar that normally appears during job execution. 6. Click the OK button to close the Preferences menu. 7. Click the View menu and select the Palettes option. Note the presence of four objects that should be checked. 8. If the palettes are unchecked, then one at a time click each of the palettes until you achieve a checkmark by each one. The History palette lets you go back to previous workspaces within the context of a user session. Note the presence of Palette tabs now visible in the left portion of the window (under the HOME menu). These tabs will be handy when switching from one workspace to another. 9. Click the File menu. Note that you can create and delete projects. 10. Click the Help menu and then the Help option to view documentation. IV.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

3-22

U n i t 3 I n f o r m a t i o n A n a l y z e r o ve r vi e w

11. Click the InfoSphere Information Analyzer link. If this link is not visible, in the top right corner, click the Search link, in the search box, type Information Analyzer, press Enter, and then in the search results, click the IBM InfoSphere Information Analyzer link. Information Analyzer documentation is divided further into various topics of interest. More documentation sources will be explored in a later demonstration. 12. Close IBM InfoSphere Information Server Console and all open windows. Results: You navigated through Information Analyzer and located the primary functions. The GUI for Information Analyzer contained standard file menus and also a custom Pillar menu.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

3-23

U n i t 3 I n f o r m a t i o n A n a l y z e r o ve r vi e w

Unit summary

• Describe the major functions of Information Analyzer • Explain the concept of data profiling

Information Analyzer overview

© Copyright IBM Corporation 2016

Unit summary

Having completed this unit you should now be able to perform the functions listed on the slide.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

3-24

Information Analyzer setup

Information Analyzer setup

Information Analyzer v11.5 © Copyright IBM Corporation 2016 Course materials may not be reproduced in whole or in part without the written permission of IBM.

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-2

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Unit objectives

• • • •

Connect Information Analyzer to a data source Import metadata Create projects Configure projects

Information Analyzer setup

© Copyright IBM Corporation 2016

Unit objectives

This slide lists the objectives to be accomplished in this course unit. test

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-3

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Resource configuration and metadata import

• Functionality is independent of Information Analyzer 

Creation of Resources and Import of Metadata is based on suite-wide common services and the common repository

• Does not require a project context 

Any Resource or Metadata imported is reusable by any other component in the Suite, such as DataStage

Information Analyzer setup

© Copyright IBM Corporation 2016

Resource configuration and metadata import

Functionality is independent of Information Analyzer in the sense that hosts and data stores can be used by any of the suite components. So although you establish these objects for Information Analyzer, they can also be used in e.g. DataStage. As you shall see later, many of the functions in Information Analyzer require that they be performed within an open project. Resource configuration and metadata import are exceptions to this guideline.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-4

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Configuring resources: Where is the data?

• In order to perform Data Analysis, you must first identify where the data is located: 

HOST to represent the host computer on which a particular database/file resides.



DATA STORE representing a database or files. A single HOST can have multiple DATA STORES.



DATA CONNECTION artifact that captures the user credentials (username/password) and type of Connector to access the DATA STORE.

Information Analyzer setup

© Copyright IBM Corporation 2016

Configuring resources: Where is the data?

In order to correctly analyze data sources Information Analyzer must be able to first find the data. This means the exact location – in terms of host, data store, and data connection – must first be defined. Host: is a computer that hosts databases or files, it must reachable on a network. Data Store: represents a collection of data, in the form of either a database or a collection of files contained in a directory. A Database contains database tables. A Data File is collection of data organized into data structures of fields. Both of these assets are stored under Hosts, and consumed by Information Server produced assets, such as DataStage jobs. Data Connection: must be defined. Examples are ODBC or native DB2.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-5

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Configuring resources: Connecting the data

• :Model

–Connector –DataStore/Connection –Connector

Host

Data Store ODBC

Connector

Information Analyzer setup

DB2 Teradata

© Copyright IBM Corporation 2016

Configuring resources: Connecting the data

This graphic represents much of the information discussed on the previous slide. In Information Analyzer, you will define a HOST and DATASTORE artifact to represent the host computer on which a particular database/file resides. A single HOST may have multiple DATASTORES. In addition, you will define a DATA CONNECTION artifact that captures the user credentials (username/password) and type of Connector to access the DATASTORE.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-6

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Metadata asset management

• Information Server metadata assets are stored in the XMETA

Repository (also called the Metadata Repository or Shared Metadata Repository)

• Metadata assets include assets produced and consumed by Information Server products and components 

Produced assets include: DataStage jobs, FastTrack mapping specifications, Information Governance Catalog terms, Information Server reports



Consumed assets include: table definitions, file descriptions, logical model entities and attributes, BI tool metadata

• Repository metadata stores different types of metadata 

Business metadata: business terms, business rule descriptions, mapping specifications, stewards



Technical metadata: DataStage/QualityStage jobs and their components



Operational metadata

Information Analyzer setup

© Copyright IBM Corporation 2016

Metadata asset management

The Information Server Repository (XMETA) stores several different types of metadata, including business metadata, technical metadata, and operational metadata. Some of the metadata is metadata produced by Information Server products, for example, DataStage jobs, which are produced by DataStage. Other metadata is consumed by is by Information Server products, such as file descriptions of files read by DataStage jobs.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-7

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Setting up Data Connection & Import metadata in IMAM

• All the functionality related to Platform Subject Area and Metadata Import has been removed from Information Analyzer

• IA users are expected to use InfoSphere Metadata Asset Manager

(IMAM) to define required data connections and to import metadata

Information Analyzer setup

© Copyright IBM Corporation 2016

Setting up Data Connection & Import metadata in IMAM

As of the current release of Information Analyzer, 11.5, the data connection and metadata functionality has been removed. This functionality is now found in the InfoSphere Metadata Asset Manager (IMAM). Users must always use IMAM for data connections and metadata definitions and functions.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-8

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Metadata Asset Manager

• Manage Repository metadata assets • Import metadata assets into the Repository, to be shared with Information Server products 



Metadata assets can be imported using engine Connectors and Bridges −

Connectors are defined on the engine server system

−

Bridges are defined on engine client systems

“Metadata Interchange Servers” are used to exchange metadata assets between the engine client and server systems that have the bridges and connectors with the IS services system Metadata Interchange Servers are installed and configured when the engine client and server software is installed

− −

New Metadata Interchange Servers can be added

• Search and browse Repository metadata assets 

Limited to external metadata assets −

Can view all assets in Information Governance Catalog

• Manage potential duplicates and disconnected assets Information Analyzer setup

© Copyright IBM Corporation 2016

Metadata Asset Manager

InfoSphere Metadata Asset Manager (IMAM) is the primary Information Server product for managing external metadata assets, those consumed, but not produced, by Information Server products. Like with the Information Governance Catalog, you can browse and search metadata assets in the Repository, but IMAM is limited to external metadata. IMAM also has import/export capabilities with respect to external metadata assets. In this respect, it complements the Information Governance Catalog which does not have these capabilities.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-9

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Metadata import: Discovering metadata

• Architecture 

All metadata in Information Analyzer is stored in the Information Server metadata repository − Metadata

created or enriched by one tool can be used immediately be another

• Prerequisites: 

User has defined necessary resources (HOST, DATA STORE and DATA CONNECTION)



Database Administrator has provided appropriate credentials to allow user to access the metadata



External Configuration of ODBC DSN is in place

Information Analyzer setup

© Copyright IBM Corporation 2016

Metadata import: Discovering metadata

Before actually analyzing data it is necessary to import existing metadata. Information Analyzer will store this as the “defined” metadata for a column. During a column analysis review Information Analyzer will display defined metadata as well as inferred metadata. Inferred metadata is the metadata that Information Analyzer would have built for the column based on the column contents.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-10

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Importing metadata assets

• Create an import area • Select metadata interchange server • Specify import parameters 



Path to source of import −A

file can exist on local system or metadata interchange server system

−A

database would have host, database, schema and table specified

Select the parameter to display documentation about it

• Imported metadata assets can be viewed first in a staging area before they are shared to the Repository 

Called a Managed import



Express imports share without staging first − Depends

on import settings

Information Analyzer setup

© Copyright IBM Corporation 2016

Importing metadata assets

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-11

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Creating a new import area

• Name of import area

• Select metadata interchange server

• Select bridge or connector

Information Analyzer setup

© Copyright IBM Corporation 2016

Creating a new import area

Metadata assets are first imported into a staging area. To create a new import staging area, click New Import Area on the Import tab. This will display the Create New Import Area window Specify a name for the import area, and then select the metadata interchange server you are using to import the metadata. The metadata assets, and the bridges and connectors available to import the assets, will vary depending on the metadata interchange server. For example, DB2 and DB2 connectors may be installed on one server but not the other. Some engine client systems may have BI metadata available that is not available on other engine client systems. After you select the metadata interchange server, select the connector or bridge you will use to import the metadata assets. For example, select the IBM InfoSphere DB2 Connector to import the physical data model and data from a DB2 database. Click Next to move to the Import Parameters page. Values to be entered will depend on the type of import. Select a parameter to display documentation about it.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-12

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Import parameters

• Select data connection • Configure other parameters as needed

Information Analyzer setup

© Copyright IBM Corporation 2016

Import parameters

There are a number of parameters that determine what will be imported. Check the boxes and fill in the values as required. Click the browse button on the Data connection box to see all available data connections

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-13

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Data connection

• Browse data connections • Create new data connection if needed

Information Analyzer setup

© Copyright IBM Corporation 2016

Data connection

If the required data connection does not show in the drop down box click the New Data Connection button.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-14

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

New data connection

• Name new data connection • Choose database for data source

• Provide credentials

Information Analyzer setup

© Copyright IBM Corporation 2016

New data connection

A new data connection needs a name, data source and credentials.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-15

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

New data connection identity

• Host system name • Choose database name for data source or leave blank

Information Analyzer setup

© Copyright IBM Corporation 2016

New data connection identity

A new data connection needs identity parameters. These are the host system name and the database name that contains the data to import.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-16

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Select type of import

• Express import: Automatically share if import settings requirements are satisfied

• Managed import: Preview metadata assets in a staging area

Information Analyzer setup

© Copyright IBM Corporation 2016

Select type of import

On this page you choose the type of import to perform. You can choose either an express import or a managed import. An express import automatically imports the metadata assets that have been loaded into the staging area into the Information Server Repository, if all import settings requirements have been satisfied. A managed import loads the assets into the staging area for you to preview, before you decide to import the assets into the Repository. In this example, a managed import has been selected. Click the Import button to import the data source. After the import has run successfully notification will be given of successful creation of the import area and the staging of the data

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-17

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

View results in the staging area

• Click Analyze to analyze assets • Click Share to Repository to import to Repository 

Disabled if import settings requirements are not satisfied; for example, assets contain potential duplicates

Information Analyzer setup

© Copyright IBM Corporation 2016

View results in the staging area

After the metadata assets have been loaded into the staging area, you can perform an analysis of the assets and preview them. Click the Analyze button to initiate the analysis. The analysis generates a set of statistics about the assets, displayed in the lower left panel. At the right panel, you can browse through the assets that have been loaded into the staging area. Click the Share to Repository button to import the assets into the Information Server Repository. This button is not enabled until you perform the analysis and preview

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-18

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Flat file definition wizard

Information Analyzer setup

© Copyright IBM Corporation 2016

Flat file definition wizard

The flat file definition wizard will be covered in the following slides.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-19

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Flat file definition wizard

• Reasons for using: 

Do not need to wait to fully define the file on the server (this function is normally performed by a technology support specialist)



Can build the QETXT.INI that is required for ODBC connectivity to sequential files

19 Information Analyzer setup

© Copyright IBM Corporation 2016

Flat file definition wizard

Information Analyzer users frequently use flat files (text files) for analysis. Flat files, like database tables, need to have their metadata defined somewhere. In databases this is usually performed in a system catalog; in flat files the QETXT.INI file is used for this purpose.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-20

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Flat file definition wizard prerequisite tasks

• Assumptions: 

The location of the files is fixed − For

example:

− Server



is on Linux OS

−A

Directory \data\KM802Files\Chemco\Seq exists

−A

File Items.txt in \data\KM801Files\Chemco\Seq

The format of the files is known − For

example, comma delimited with quotes

• Create an ODBC connection on server: 

Note: DO NOT need to provide detail column definitions (not necessary)

• Create data store definition using ODBC in Information Analyzer

20 Information Analyzer setup

© Copyright IBM Corporation 2016

Flat file definition wizard prerequisite tasks

If you are using the flat file definition wizard you need to perform several prerequisite tasks. The tasks listed on the slide will put you in a position to use the GUI interface to the wizard.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-21

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Flat file definition wizard

• Import Metadata 

Home > Import Metadata, select data file

• Select Identify Flat File from Tasks list • Wizard will lead through the steps to create the detailed metadata

21 Information Analyzer setup

© Copyright IBM Corporation 2016

Flat file definition wizard

To get started using the flat file definition wizard use your Information Analyzer interface to navigate to its location. Once you've started the wizard you will be presented with a series of screens that will step you through the process designed to create the QETXT.INI file.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-22

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Creating and configuring projects

Information Analyzer setup

© Copyright IBM Corporation 2016

Creating and configuring projects

Detailed analyses and creation of data rules is performed under the umbrella of a project.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-23

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Projects

• Overview 

Information Analyzer operates within the confines on an Analysis Project. This containment vehicle provides the user with a selected view of the repository and the activities performed against it.

• Key Components of a Project are: 

Details



Data Source Administration



User Administration



Access Control



Analysis Options

Information Analyzer setup

© Copyright IBM Corporation 2016

Projects

Most analyses will be performed within the context of a project. An Information Analyzer project will define data to be examined and users with the authority to analyze that data.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-24

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Creating a project

• First Steps: 

Click New Project from the Project dropdown menu



Select Type = Information Analyzer



Name project

Information Analyzer setup

© Copyright IBM Corporation 2016

Creating a project

Only users with Information Analyzer Administrator authorization from Information Server will be able to create, modify, or delete projects.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-25

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Complete project properties: 7 categories

• Project Details

capture the identifying attributes of a project: 

Type



Name



Description



Owner



Primary Contact

Information Analyzer setup

© Copyright IBM Corporation 2016

Complete project properties: 7 categories

Create entries in the data sources and user tabs to initiate your project settings. Further refinement to Information Analyzer analysis settings can be made in the tab labeled "Analysis Settings". These settings will become the default for your project but can also be overridden in some cases at the column analysis level.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-26

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Project data source administration

• Source Registration, is the process by which an Information Analyzer project denotes a specific interest in a Data Store or Data Collection/Table, or any sets of those objects. 

Allows the core repository information about those objects to exist unchanged.



Allows the GUI to partition the information to be displayed to the user.



Mechanism by which IA creates corresponding Analysis Masters for each object in the repository and creates a relationship to those repository objects.

Information Analyzer setup

© Copyright IBM Corporation 2016

Project data source administration

Data source registration connects your project to the data sources that were imported by the Data Administrator. This source registration process will create a new set of tables in the Information Analyzer database (IADB).

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-27

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Register interest in data to be analyzed

• Register Interest: 

From the Project Properties tab, click the Data Sources tab



User can browse and select only those sources relevant to the project



User can select an entire Table or a subset of Columns within a Table



Add or remove Tables and Columns from a project at any time

Information Analyzer setup

© Copyright IBM Corporation 2016

Register interest in data to be analyzed

Registering interest in source data does not copy the source data or its defined metadata - rather, a record link is created to the source data.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-28

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Add users/groups to a project and define role

• Adding Users: 

From the Project Properties tab, click the Users tab.



Click Browse to find available users.



Select User you want to add and click Add and OK.



Groups are handled the same way.

Information Analyzer setup

© Copyright IBM Corporation 2016

Add users/groups to a project and define role

Users can be collected into groups and therefore treated as an entire category – this simplifies administration.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-29

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Adding users/groups to a project

• Add Project Roles to Users 

Select the roles appropriate for the new user, and click Save All.

• Project roles • Information Analyzer Data Analyst 

Reviews analysis results. This role can set baselines and checkpoints for baseline analysis, publish analysis results, delete analysis results, and view the results of analysis jobs.

• Information Analyzer Data Operator 

Manages data analyses and logs. This role can run or schedule all analysis jobs.

• Information Analyzer Data Steward 

Provides read-only views of analysis results. This role can also view the results of all analysis jobs.

• Drill-down user: View full data record Information Analyzer setup

© Copyright IBM Corporation 2016

Adding users/groups to a project

Users and groups can be assigned a project rule - this will influence what they can do in Information Analyzer.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-30

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Analysis configuration

• Engine: 

Change default settings



Engine instance used to run scripts (DataStage jobs)



Change DataStage project



Change user ID

• Database: 

IADB database



JDBC setting

• Analysis Settings

Information Analyzer setup

© Copyright IBM Corporation 2016

Analysis configuration

Analysis configuration ensures that Information Analyzer can properly communicate with the DataStage engine and persistent repository of the Information Server platform. Configuration settings under the title Analysis Settings can provide threshold values that are used in the Information Analyzer flagging system. This flagging system, the use of red and green icons, is used to catch the user's attention during data analysis review procedures.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-31

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Project analysis settings

• Analysis options are used by the system to control analysis and its results: 

The system is installed with default settings for these options.



The user can change these default settings for the system, a project, a data source, a data collection or a data field.



Changes in the analysis options can typically tighten or loosen the system’s capability to make its analytical inferences.

Information Analyzer setup

© Copyright IBM Corporation 2016

Project analysis settings

Analysis settings - originally set at the Information Analyzer product level - can be overridden at the project level. These parameters influence Information Analyzer analysis results. Most analysis settings can be overridden at the individual analysis result level by the data analyst.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-32

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Checkpoint 1. True or False? Source metadata must be imported before Information Analyzer can analyze data. 2. True or False? Threshold parameters can be set at a global level over all projects. 3. True or False? Information Analyzer can add OS level users.

Information Analyzer setup

© Copyright IBM Corporation 2016

Checkpoint

Answer the checkpoint questions to test your mastery of the material presented.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-33

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Checkpoint solutions 1. True or False? Source metadata must be imported before Information Analyzer can analyze data. True 2. True or False? Threshold parameters are set at a global level over all projects. True 3. True or False? Information Analyzer can add OS level users. False

Information Analyzer setup

© Copyright IBM Corporation 2016

Checkpoint solutions

Checkpoint questions and answers.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-34

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Demonstration 1 Configuring Information Analyzer

• Creating ODBC data source • Set Information Analyzer configuration options to enable data profiling jobs

• Connecting Information Analyzer to the Source Data • Importing metadata • Creating projects

Information Analyzer setup

© Copyright IBM Corporation 2016

Demonstration 1: Configuring Information Analyzer

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-35

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Demonstration 1: Configuring Information Analyzer Purpose: This demonstration will show students the configuration settings for Information Analyzer at the product level. You will create ODBC data source, set Information Analyzer configuration options, add the data store, import Chemco defined metadata and create the project, add users, and register interest in source data.

Task 1. Create ODBC data source. 1. 2. 3.

From the desktop open the 32-bit ODBC manager by double-clicking the odbc admin 32 icon. Click the System DSN tab. Click the Add button.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-36

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

4.

In the Create New Data Source window, click the IBM TextFile driver.

5. 6.

Click Finish. In the Data Source Name box type Chemcoseq. Ensure to type Chemcoseq and not just Chemco. In the Database Directory box type in the path to the sequential files:

7.

.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-37

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

8.

Check the Column Names in First Line box.

9.

Click Test Connect.

It will return successful. © Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-38

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

10. Click OK to close the Test Connect dialog and then click OK to return to the system DSN window. You will be returned to the System DSN window where the new data source will show Chemcoseq.

11. Click OK.

Task 2. Set Information Analyzer configuration options to enable data profiling jobs. 1.

Double-click the IBM InfoSphere Information Server Console icon on the Windows desktop.

2.

Log into Information Server using student/student.

3.

Click the Home pillar menu, open the Configuration branch, and then click the Analysis Settings option.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-39

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

4.

Click the Analysis Database tab. This is the database that will contain the results of your data analysis.

The analysis database - commonly referred to as the IADB - will contain tables with column value histogram data. The IADB database will grow in size as more and more data is analyzed. Note that you can update most options present on this screen. However, it is a product requirement that this database be accessible via both ODBC and JDBC. The connection must be on the server, not the client. These ODBC and JDBC connections have already been created for you.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-40

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

5.

Click the Analysis Engine tab.

The analysis engine is actually the DataStage parallel engine. The DataStage username and password, if used on this screen, must correspond to a username and password with proper DataStage credentials as defined in the Information Server Web console. Do not change any settings - static credentials will work for these demonstrations. The entry under DataStage Project is the name of the DataStage project where all of the Information Analyzer analysis jobs will be executed; by default this is ANALYZERPROJECT. The Retain Scripts option determines whether job execution scripts will be saved in the DataStage project directory once the job has been completed. Since you want to have the script deleted if the job runs successfully, this option is normally set to No. This option can be overridden at the time the individual job is submitted for execution.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-41

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

6.

Click the Analysis Settings tab.

7.

These values are threshold settings that direct Information Analyzer on how to handle various situations in data analysis. These options can be overridden during data profiling review. You will encounter them in later demonstrations. Minimize Information Server.

Task 3. Connecting Information Analyzer to the Source Data. 1.

Double click the Metadata Asset Manager icon on the Windows desktop.

2.

Log into Metadata Asset Manager using student/student.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-42

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

3.

Click the Import tab.

4.

Click the New Import Area button.

5. 6.

Type Chemcoseq into the Import area name box. Move the scroll bar in the Select a Bridge or Connector box down to the ODBC connector and select it. Click Next.

7.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-43

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

8.

Beside the Data connection box, click the Select data connection button.

In the Select a Data Connection window, click the New Data Connection button. 10. Enter Chemcoseq as the name. 11. Choose Chemcoseq in the Data source drop down box, enter student/student in the Username and Password boxes, select the Save 'Password' check box, and then click OK.

9.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-44

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

The new connection will be tested and the window returned back to the Create New Import Area window.

12. Click the Next button. 13. In the Create New Import Area window, click the Select existing asset button 14. 15. 16. 17.

at the end of the Host system name box, and then choose IBMCLASS. Click OK. Click Next. On the next window type Chemcoseq into the Import Description box. Ensure Managed Import is selected and then click Import.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-45

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Task 4. Importing metadata. Having created a new import area in the previous task and clicked Import a window will show that it is processing the import and then return the following messages:

1. 2.

Click OK. You will now return to the Staged Imports tab. Click the Analyze button and then expand the Host folder to display the data files.

The statistics section shows the status of the assets in the import. You can check to make sure there are no Invalid Identities.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-46

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

3.

4. 5. 6.

7.

Click the Preview button.

The new window also has a statistics section but here certain cells have the value underlined. Click one of these underlined cells to drill down into the details behind the cells value. Once you have reviewed the details, click Close to return to this window. Once satisfied that the import was successful and there are no errors, click the Share to Repository button and click Yes to confirm the import. This will import the assets into the repository. Close Metadata Asset Manager.

Task 5. Creating projects. 1.

Maximize Information Server Console.

2.

Several methods can be used to create a new project. Click the drop-down arrow to the right of the pillar icons.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-47

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

3. 4.

Click the New Project option. Enter Chemco into the Name box and choose Information Analyzer for the type.

5.

Click OK. A project properties screen will appear. Note its tabs. Take a moment to visit each of the other tabs and then return to the Details tab. Owner and Primary Contact information can be assigned, if desired, by clicking the associated icon. This will browse the Information Server user list.

6.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-48

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

7.

Click the Enable drill down security checkbox.

8.

Click the Data Sources tab. This is used to register interest in a data source that already exists in the repository. Recall that you imported the Chemcoseq metadata into the repository in an earlier task. Click the Add button in the lower right-hand portion of the screen.

9.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-49

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

10. Successively click the arrow buttons to reveal the Seq data source tables.

11. To select all tables in the Seq source, click the Seq object and then click OK. You will be returned to the project's Data Sources tab. 12. Verify that you have the following tables:

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-50

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

13. Click the Users tab, select student, and then select all the project roles.

14. Click the Browse button located in the lower portion of the screen. 15. Add user jharris to your project and assign Data Operator, Business Analyst, and DrillDown User roles.

16. Click the Save All button located in the lower-right portion of the screen. 17. Click the Analysis Settings tab. Parameters shown on this screen will be used throughout the profiling analysis but can also be restricted in your project. Note the Select View panel on the lefthand portion of the screen. It defaults to Project view.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-51

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

18. Click the Data Sources view in the Select View panel. Default values for various thresholds are displayed. These values determine when Information Analyzer will suggest certain analysis decisions.

19. Select the Vendor table and then click the Modify button in the lower-right portion of the screen. You will now see the Analysis Settings, but note you are placed on the Options view located in the upper-left portion of the window.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-52

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

20. Click the 'Where clause' view and enter a condition for the VENDORCODE column: VENDORCODE = ASCO. Note: This can be accomplished by clicking the Add Condition button in the lowerright portion of the screen and double-clicking in the column cell and the value cell. By completing the Where clause for the VENDOR table, you are limiting the IA analyses to only the data qualified by that Where clause. This restriction will apply only to the current project. By using the Where clause you can enforce security by value. Threshold parameters can be set at the database, table, column, or even column value (using the Where clause) levels.

21. Click OK and notice that a red flag now appears next to the VENDOR table. This means that analysis settings for the vendor table differ from the analysis settings for the project. 22. Since you do not want to really restrict the records found in the vendor table, repeat the process used to create the condition but remove the condition instead. Make no further changes. 23. Close Information Server Console. Results: This demonstration showed students the configuration settings for Information Analyzer at the product level. You created the ODBC data source, set Information Analyzer configuration options, added the data store, imported Chemco defined metadata and created the project, added users, and registered interest in source data.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-53

U n i t 4 I n f o r m a t i o n A n a l y ze r s e t u p

Unit summary

• • • •

Connect Information Analyzer to a data source Import metadata Create projects Configure projects

Information Analyzer setup

© Copyright IBM Corporation 2016

Unit summary

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

4-54

Unit 5

Data Classes

Data Classes

Information Analyzer v11.5 © Copyright IBM Corporation 2016 Course materials may not be reproduced in whole or in part without the written permission of IBM.

Unit 5 Data Classes

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-2

Unit 5 Data Classes

Unit objectives

• Understand the relationship between categories and terms in Information Governance Catalog and Information Analyzer

• Link terms to data objects • Create data definitions • Understand IA thin client

Data classes

© Copyright IBM Corporation 2016

Unit objectives

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-3

Unit 5 Data Classes

Goal is to document the data

• There is a core set of information that all enterprises require 

Standard names and definitions for data items: − Organized

as hierarchies

− With descriptions,

examples, abbreviations, and stewardship information

Example: GL Account Number: The ten digit account number. Sometimes referred to as the account ID. This value is of the form L-FIIIIVVVV.

Data classes

© Copyright IBM Corporation 2016

Goal is to document the data

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-4

Unit 5 Data Classes

Business metadata

• Documents the business meaning of data • In the language of the business, independent of technology • Used to: 

Define a shared meaning of data



Establish responsibility, and accountability



Govern access



Share insights and experiences among users

• Should be managed by those that understand the meaning and importance of the data

• Helps to align the efforts of IT with the goals of the business

Data classes

© Copyright IBM Corporation 2016

Business metadata

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-5

Unit 5 Data Classes

Information Analyzer new features Information Analyzer now includes a "new" entitlement to Information Governance Catalog (IGC) Information Governance Catalog is a “Supporting Program” In the event a customer does not already have Information Governance Catalog license entitlement, this "Supporting Program" permits them to utilize *only* the new Data Class features of Information Governance Catalog in coordination with Information Analyzer

IA 11.3.1.1 and earlier

IA current release

Data Classes are private to IA while they are not shared with IGC and so any tool that can consume IGC data

Data Classes are shared across all components of Information Server

No way to create / modify / delete data classes

Ability for users to create / modify / delete data classes in IGC UI / IA CLI

Data Classification Analysis takes place in Domain tier (after CA) which means more load on WAS

Data Classification Analysis takes place in Engine tier along with column analysis which means the process scales with the number of nodes and reduces the load on WAS (the engine is typically the tier with the best computing resources)

Data classes

© Copyright IBM Corporation 2016

Information Analyzer new features

IA now brings with it a synergy between Information Analyzer (IA) and the Information Governance Catalog (IGC). Licensing has been modified so that IA includes a “new” entitlement to Information Governance Catalog (IGC). In the event a customer does not already have Information Governance Catalog license entitlement, this “Supporting Program” permits them to utilize only the new Data Class features of Information Governance Catalog in coordination with Information Analyzer. From now on, when IA is installed IGC should also be installed, always. Data Classes are now shared across all components of Information Server. The ability is now available for users to create / modify / delete data classes in IGC UI / IA Thin Client One of the more important changes with this fixpack is that Data Classification analysis now takes place in the Engine tier along with Column analysis. Previously, it took place in the Domain tier after Column analysis. This now means the process scales with the number of nodes and reduces the load on WAS (the engine is typically the tier with the best computing resources).

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-6

Unit 5 Data Classes

Information Governance Catalog: Data Classes

Data classes

© Copyright IBM Corporation 2016

Information Governance Catalog: Data Classes

Once IGC has been installed, including a number of data classes, they can be viewed directly from the IGC menu as a hierarchy or, from the browse list in IGC.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-7

Unit 5 Data Classes

Information Governance Catalog: Data Classes installed

• IGC out-of-the-box Data Classes

• Create your own: 

Data Classes



Hierarchies −

You MUST keep the current Credit Card Number hierarchy

• Robust API to

export/import custom data classes

Data classes

© Copyright IBM Corporation 2016

Information Governance Catalog: Data Classes installed

Here is the list of the out-of-the-box classes which are installed at the same time that IGC is installed. These cover all 3 types of Data Classes: Regex (SSN, ZIP, IP Address, Email etc.); list of values (Gender, CountryCode, USStateCode) and java classes. While you can create your own hierarchies, please note that you MUST keep the ‘Credit Card Number’ hierarchy. This is because of the way the java code was written. The IA Admin API was extended to include data class export and import capabilities using XML files.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-8

Unit 5 Data Classes

Examples of the Three Types of Data Classes

• Three types of Data Classes

‘Regex’ Regular Expressions

List of valid values

Custom Java Class

Data classes

© Copyright IBM Corporation 2016

Examples of the Three Types of Data Classes

Here are examples of each of the three types of data classes. The first type is the Regular Expression (‘regex’). A regular expression tests the format of an US SSN (with or without dashes). Note that this regular express is testing for fields that ONLY contain SSN (that’s the ^ at the beginning and the $ at the end). In this case only string data types will be validated, any values in the column that were not strings would be rejected. Maximum and minimum data lengths are inferred from the Column Analysis. Any values outside of these lengths would be rejected. The next type is a list of values, such as gender. Note that the length of the values does not need to be the same (Female vs F), and that Case Sensitivity is optional. This can be used for known lists that are not too large. The last, and most versatile type, is to write your own java class. This comes in handy when there is some sort of calculation that needs to happen to classify the data. This example is a credit card that has a check digit e.g. JapanCB, CreditCard etc.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-9

Unit 5 Data Classes

All Data Classes are managed by Information Governance Catalog (IGC). Any Data Class can be enabled or disabled in IGC meaning it is enabled or disabled for all projects. Any Data Class can also be enabled or disabled within IA for a specific project only. IA command line API has the ability to install Data Classes so you are not forced to use the IGC UI. This is very helpful when creating Data Class that has dozens or even thousands of values.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-10

Unit 5 Data Classes

Demonstration 1 IGC data classes

• Using IGC examine installed data classes

Data classes

© Copyright IBM Corporation 2016

Demonstration 1: IGC data classes

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-11

Unit 5 Data Classes

Demonstration 1: IGC data classes Purpose: This demonstration shows how to use IGC to examine data classes. A number of default data classes are installed automatically in IGC.

Task 1. Examine the installed data classes in IGC. 1.

Logon to Information Governance Catalog using the IIS Server Launchpad using student/student.

2.

Select the Information Governance Catalog login using student/student.

3.

From the drop down menu, choose Information Assets > Data Classes.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-12

Unit 5 Data Classes

4.

The default installed Data Classes are listed in the left pane.

5.

Select Country Code to see the right pane View Details populated with the details of the Country Code data class, including its type (in this case, Valid Values).

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-13

Unit 5 Data Classes

6.

Click the twisty in the Definition box to see details about Country Code, including all the valid values.

7.

Examine the other data classes until you have found examples of all three types of data classes: Valid Values, Regex, and Java class.

Results: This demonstration showed you how to use IGC to examine data classes. A number of default data classes are installed automatically in IGC.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-14

Unit 5 Data Classes

Information Analyzer: Data Classification General W ork flow for Data Classification • Information Governance Catalog (IGC) is the keeper/owner of the Data Classes 1)

Examine the Data Classes in IGC, add any new and disable any not needed

2)

You must test every single Data Class that is important to you, both positive and negative testing to ensure the Data Classes are doing what you expect with your own data

3)

Now run Information Analyzer > Column Analysis. Every data value in every table column will be evaluated against all of the Data Classes (as found in IGC).

4)

In Information Analyzer, you can review each Data Class on a column by column basis

5)

Once you “Publish Results” from Information Analyzer, then IGC can also see the Data Classes found by Information Analyzer

Data classes

© Copyright IBM Corporation 2016

Information Analyzer: Data Classification

This is a typical overall flow for Data Classification. 1. 2.

3. 4. 5.

First thing is to examine the data classes in IGC and create any new ones required by the business. Also, enable or disable any existing data classes as necessary for data classification analysis. In addition it is important to test all data classes in data classification scenarios before putting them into production. The logic needs to be tested to ensure the results match the business requirements. Make sure to test with values that you expect to fail (i.e. test the negatives). Make sure you test any class you are using to make sure it meets your needs. Once the data classes are validated, Column Analysis can be run in IA. All the enabled data classes, both in IGC and IA will be evaluated. A data analyst can now review the results of the classification in IA. He will mark the data classes as either valid or invalid. Finally the data analyst's results are published from within IA. This means the data classes are now available in IGC and available to all the components in Information Server. Publish so that everyone else can benefit from the analysis.

Note: Steps 3, 4 and 5 are covered in detail in later units. They are included in this current unit for completeness of the data classification and profiling workflow. © Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-15

Unit 5 Data Classes

Information Analyzer data classes

List of active data classes as defined by the user in IGC

Data classes

© Copyright IBM Corporation 2016

Information Analyzer data classes

In the IA Project Properties -> Analysis Settings -> Project screen it is possible to see all the data classes available to this project. The user can enable or disable data classes for this project as required.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-16

Unit 5 Data Classes

Column Analysis - Details - Data Class

Data classes

© Copyright IBM Corporation 2016

Column Analysis - Details - Data Class

This shows the Column Analysis -> View Details -> Data Class tab. • Will list all found data classes (some may be in more than one class) • 'Selected' will default to highest count/percent found threshold • The Data Classes by Value columns show data classes that meet or exceed the thresholds defined for these particular data classes. This means the data class becomes both the inferred and selected data class for that column. • Selecting any Data Class will show examples of the values on the right • The example data values for the data class are just that, examples, not all of the data values found. Once the data analyst has reviewed and validated the data classification results, the data classes can be published and be available to all other components through the IGC.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-17

Unit 5 Data Classes

Information Governance Catalog - data classes (1 of 2) Multiple ways to view Data Classification in IGC • • • •

Start at Data Class Start at Table or Schema Custom AdHoc Queries Column Level View

• Data Classification View

Data classes



Shows all columns that have that class selected



Shows all columns detected by the class

© Copyright IBM Corporation 2016

Information Governance Catalog - data classes

After we have run the data classification and published the results in IA we can turn our attention to IGC. Selecting any Data Class will show all the columns that have that class, along with any other columns that may have been detected as having that Data Class. In this slide we are looking at a data class called Country Code. It tells us that there are 11 columns that have this data class; 2 of which have been selected and 9 of which have been detected.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-18

Unit 5 Data Classes

Information Governance Catalog - data classes (2 of 2) • Column View

Data classes



Shows all classes for a column



Identified candidate, inferred, and selected



Shows frequency based confidence

© Copyright IBM Corporation 2016

This slide takes a different view by looking at a specific column in this case COUNTRY_CODE. Selecting any individual column will show all Data Classes for that column. Also shows a ‘Confidence’ that roughly translates to frequency (e.g. the values in this column contained country code 95.89% of the time). This means that of the four possible data classes the tool selected COUNTRY_CODE due to the high confidence level.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-19

Unit 5 Data Classes

Information Governance Catalog - disabling a class

• Disabled classes will not be seen in Information Analyzer

Data classes

© Copyright IBM Corporation 2016

Information Governance Catalog - disabling a class

A class can be disabled in IGC so that it does not even show up in Information Analyzer. Here we see the Internet Protocol Address which is enabled as the radio button is set for True. By setting the radio button to False, the data class would be disabled.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-20

Unit 5 Data Classes

Information Governance Catalog - deselecting a class

• A deselected class will default to deselected for all new projects

• But, it CAN be reselected for that project

Data classes

© Copyright IBM Corporation 2016

Information Governance Catalog - deselecting a class

Additionally, a class can be deselected at a global level so that it is not used in any project. It can also be selectively deselected at a project level while remaining selected at the global level. Conversely it can be deselected at the global level but selected at the project level. Deselecting classes may help increase performance because there are less things to look for.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-21

Unit 5 Data Classes

Data Classification Summary • Customizable, scalable, and secure sensitive data discovery

• Provide InfoSphere

Discovery customers a better solution

• Discovery sensitive

data for IBM InfoSphere Optim and IBM InfoSphere Guardium

• Fully integrated with

Information Governance Catalog and the rest of Information Server

Information Analyzer is MORE than JUST profiling - it finds sensitive data and data relationships, and measures ongoing quality! Data classes

© Copyright IBM Corporation 2016

Data Classification Summary

Information Analyzer is more than just for profiling. For functionality, it has data rules, relationship discovery, and data classification. Beyond that, it is complete integration with Information Server, which means it is scalable and secure. Add to that integration with Optim and Guardium and you have a world class product that delivers complete functionality to enterprise customers.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-22

Unit 5 Data Classes

Information Analyzer thin client

Data classes

© Copyright IBM Corporation 2016

Information Analyzer thin client

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-23

Unit 5 Data Classes

Information Analyzer thin client (1 of 2)

• New thin client introduced in 11.5 Rollup 1 • A lightweight, browser-based companion to the InfoSphere Information Analyzer workbench

• It can be used together with the IA workbench. Changes done in the IA workbench are visible in the thin client and the opposite

• Still covers a subset of the functionalities of IA workbench but also provides additional capabilities not available in the IA workbench

• Data analysts can execute, view and edit analysis results for data sets and view data quality scores for tables and columns

• Start the thin client from the launchpad or with the following URL: 

https://server:port/ibm/iis/dq/da/

Data classes

© Copyright IBM Corporation 2016

Information Analyzer thin client

This is a new lightweight browser based client which will eventually replace the current locally installed workbench. Currently it only covers a subset of the workbench features but in the near future it will cover all of them plus, some additional new features.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-24

Unit 5 Data Classes

Information Analyzer thin client (2 of 2) • Features supported in both IA workbench and DA UI: 

Run/view Column Analysis and classification on relational sources



Review and manually confirm or set inferred properties and data classes



View/add notes/terms



Publish analysis results

• New featured available in the thin client only: 

Browse/preview/import/analyze files from HDFS (without IMAM)



Data quality analysis with computed quality score



Advanced search

• Tasks that need to be done in the IA workbench: 

Configuration, analysis settings, project management



Import non HDFS sources (done through IMAM)



Key analysis, Cross Domain Analysis, Data rules



Scheduling / Sampling



Reports

Data classes

© Copyright IBM Corporation 2016

Features supported in both IA workbench and DA UI: • Run/view Column Analysis and classification on relational sources • Review and manually confirm or set inferred properties and data classes • View/add notes/terms • Publish analysis results • Display data rule results (cannot create or run from IA think client) New featured available in the thin client only: • Browse/preview/import/analyze files from HDFS • IMAM is still used but it is behind the scenes and so does not have to be invoked by the user • Data quality analysis with computed quality score • Advanced search

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-25

Unit 5 Data Classes

Tasks that need to be done in the IA workbench: • Configuration, analysis settings, project management • Import non HDFS sources (done through IMAM) • Key analysis, Cross Domain Analysis, Data rules • Scheduling / Sampling • Reports

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-26

Unit 5 Data Classes

Information Analyzer thin client terminology

• Small differences in terminology between the Data Analyst UI and the workbench:



Workspace in thin client= Project in IA workbench



Data Set in thin client= Table in IA workbench



Find Data in thin client= Browse metadata of imported tables or tables to import in IMAM



Add Data Set in thin client= IMAM meta-data import + registration in IA project



Run Analysis in thin client= Run Column Analysis for all columns of a table in IA, followed by data quality analysis

Data classes

© Copyright IBM Corporation 2016

Information Analyzer thin client terminology

Small differences in terminology between the Data Analyst UI and the workbench: • Workspace in the thin client = Project in IA workbench • Data Set in the thin client = Table in IA workbench • Find Data in the thin client = Browse metadata of imported tables or tables to import in IMAM • Add Data Set in the thin client = IMAM meta-data import + registration in IA project • Run Analysis in the thin client = Run Column Analysis for all columns of a table in IA. Note. Run Analysis in the thin client not only runs the column analysis but also runs the data quality analysis, something which is not available in the workbench. Data Sets • An HDFS file • A table • Imported via IMAM • Can be non-HDFS flat file using the ODBC connector • Can be non-HDFS flat file using the "File Connector - Engine tier" connector • Can be Hive (using the Hive OBDC driver) © Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-27

Unit 5 Data Classes

Information Analyzer thin client - Advanced Search Search powered by Solr

• Search by general keywords (look in table names, descriptions, columns, column descriptions, terms, classes, etc...)

• Facetted search by 

Data Set type (flat file vs table)



Number of columns



Data Classes



Data Quality



Analysis state



Etc...

Data classes

© Copyright IBM Corporation 2016

Information Analyzer thin client - Advanced Search

The thin client implements a search function with greater capabilities than the current workbench. It has the ability to search by general keywords in: table names; descriptions; columns; column descriptions; terms; classes; etc... It also has a facetted search by: • Data Set type (flat file vs table) • Number of columns • Data Classes • Data Quality • Analysis state • Etc...

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-28

Unit 5 Data Classes

Data Quality Score • Evaluated at a column and data set level • Evaluated based on violation of 8 data quality dimensions plus data rules 

Data Class



Data type

− −



'FL' in a column classified as Credit Card '3.14159' in a column with Decimal(4,2) data type

Format −

Based on marking a particular format ‘invalid’ in the interface



Minimum/Maximum value



Missing values



Inconsistent missing value representation



Suspect values



Uniqueness (Duplicate values)



Rule Violations

− − − − − −

Based on user defined minimum and maximum values Missing means either empty or null

A column contains both 'Null' and 'empty value' representations of missing values Applies to columns with no data class: 'MA' in a column where most other data is numeric When more than 95% and less than 100% of the values are unique Uses 'Percentage not met' (even if you select 'output all records')

Data classes

© Copyright IBM Corporation 2016

Data Quality Score

When a data set is analyzed in the thin client, column analysis is done on all its columns, followed by data quality analysis giving us a data quality score. This is done by searching for potential data quality issues and computing the unified data quality score. The inferred properties obtained from the column analysis as well as the metadata provided by the user in the thin client or in IGC are used to compute a quality score for the whole data set as well as for each column. The columns levels are averaged to a score for the data set. The score is based on the 8 dimensions of data quality plus the data rules. The eight dimensions are documented at: http://www.ibm.com/support/knowledgecenter/#!/SSZJPZ_11.5.0/com.ibm.swg.im.iis.ia. product.doc/topics/r_quality_indicators.html

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-29

Unit 5 Data Classes

Data Quality Score Example

Data classes

© Copyright IBM Corporation 2016

Data Quality Score Example

Here is an example of how column and data set data quality scores are computed. Note the column names and data values. In this example, we use three colors to indicate data quality issues: Green for suspect values, yellow for missing values, and purple for duplicate values. Looking at the Name column, we see: Two missing values, and one suspect value (it is a number when all others are text). That is 3 data quality violations in 10 rows, so that’s a score of 70%. Note that there is no concept of 'scale' - all data quality violations are treated the same. A missing value detracts from the data quality score just ask much as a suspect value. For the Address column, we see: Three missing values and two suspect values. This means: 5 out of 10 rows or 50% have issues. For the Phone column: There are two each of duplicate and missing values. So the score is 60%. For the data set data quality score: It is computed as the average of all the data quality scores. In this case, the average is 60%.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-30

Unit 5 Data Classes

Checkpoint 1.True or False? Creating classifications gives you the ability to group data objects. 2.True or False? Terms can be entered in either Information Analyzer or Information Governance Catalog. 3.True or False? Categories containing sub-categories must be entered in Information Governance Catalog, not Information Analyzer.

Data classes

© Copyright IBM Corporation 2016

Checkpoint

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-31

Unit 5 Data Classes

Checkpoint solutions 1. True or False? Creating classifications gives you the ability to group data objects. True 2. True or False? Terms can be entered in either Information Analyzer or Information Governance Catalog. True 3. True or False? Categories containing sub-categories must be entered in Information Governance Catalog, not Information Analyzer. True

Data classes

© Copyright IBM Corporation 2016

Checkpoint solutions

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-32

Unit 5 Data Classes

Demonstration 2 Familiarization with IA thin client

• Work with the IA thin client features

Data classes

© Copyright IBM Corporation 2016

Demonstration 2: Familiarization with IA thin client

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-33

Unit 5 Data Classes

Demonstration 2: Familiarization with IA thin client Purpose: Work with the new Information Analyzer thin client.

Task 1. Explore IA thin client. 1.

Logon to Information Server using the IIS Server Launchpad.

2.

Select Information Analyzer using student/student.

The thin client will show all existing Information Analyzer Thick Client projects:

3.

Use Ctrl+ and Ctrl- to resize the cards. Press Ctrl0 when done to reset to 100%.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-34

Unit 5 Data Classes

4.

Select the Find data tab at the top of the screen. You will see all the metadata imported via IMAM and used in any current Information Analyzer projects:

5.

Click the Sort icon to see the ways you can sort the data sets.

6.

Click the Search icon to see the ways you can sort the data sets.

7.

Examine the list of search options.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-35

Unit 5 Data Classes

8.

Search for the keyword 'ord' (without the quotes) by typing text where it says Type text.

9.

Look at the names of the data set(s) returned. Do they have 'ord' in the file name? Search searches file names, descriptions, and column names. The upper left of the screen tells you that you are looking at a subset of your data sets.

10. Clear the search by clicking the red x.

A filter will show only data sets with any selected data class (for example, 'email address'). To use filters, bring up the search pane as previously.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-36

Unit 5 Data Classes

11. Under Filters, expand Selected data class and uncheck Select all to clear all check boxes.

12. Check Code and then click Apply Filter.

13. You should see 4 data sets now that the filter has been applied:

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-37

Unit 5 Data Classes

14. Now apply an additional filter for 'Found data class' of Date. Apply this filter as per 'Selected data class'.

15. How many data sets do you see now? Multiple filters are an 'and' condition.

The search pane may cover the right hand side of the data sets. Close this by clicking the x in the search pane.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-38

Unit 5 Data Classes

16. Clear the filters by clicking the red x or selecting clear.

17. Close the search pane (if necessary).

Results: You worked with the new Information Analyzer thin client.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-39

Unit 5 Data Classes

Unit summary

• Understand the relationship between categories and terms in Information Governance Catalog and Information Analyzer

• Link terms to data objects • Create data definitions

Data classes

© Copyright IBM Corporation 2016

Unit summary

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

5-40

Column Analysis

Column Analysis

Information Analyzer v11.5 © Copyright IBM Corporation 2016 Course materials may not be reproduced in whole or in part without the written permission of IBM.

U n i t 6 C o l u m n A n a l ys i s

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

6-2

U n i t 6 C o l u m n A n a l ys i s

Unit objectives

• Perform column analysis

Column analysis

© Copyright IBM Corporation 2016

Unit objectives

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

6-3

U n i t 6 C o l u m n A n a l ys i s

Understand the business problem

• Column Analysis examines the structure and content, which allows us to infer and compare what is the defined structure versus what is the inferred structure.

• Defined metadata - entered Into Information Analyzer by the import metadata process

• Inferred metadata - results of an Information Analyzer content analysis task

Column analysis

© Copyright IBM Corporation 2016

Understand the business problem

Understanding the business problem should drive most of your analysis tasks. The objective of Column Analysis is to help understand the structure and content of data together. Looking only at one dimension of this problem is not enough. Why do you need to understand both structure and content? If you looked only at the structure of the data (metadata), then the metadata itself would have to be precise and convey the intent of each column. How often do you come across precise metadata definitions? If you looked only at the content of the data, you could potentially uncover issues in the content, but you also need to understand the domain of the column to see the complete picture.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

6-4

U n i t 6 C o l u m n A n a l ys i s

Column Analysis overview

• Occurs in the context of a project: 

Allows the same column to be analyzed multiple times in different projects.



References metadata in the shared repository, does not copy the metadata down into the project. Allows for the comparison of metadata over time (Baseline Analysis).

• Creates a base profile which stores both structure and content analysis: 

Infers from content the columns Data Classification ( Identifier, Code, and so on).



Infers from content the columns Properties ( Data Type, Length, and so on).



Generates Frequency Distribution for all values.



Generates General Format for all values.

• User Review: 

Allows the user to drill down into the results of the Base Profile and accept or alter the inferences.



Allows the user to perform domain and completeness review on the resultant frequency distribution.



Allows the user to perform General Format review.

Column analysis

© Copyright IBM Corporation 2016

Column Analysis overview

The base profile is originally built by a column analysis job. This base profile will contain information such as data values encountered, assignment of data classes, and inferences (Information Analyzer’s conclusions about data properties based on column content). After the Information Analyzer job performs the data profiling tasks, users and data analysts review the results and can make modifications.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

6-5

U n i t 6 C o l u m n A n a l ys i s

What does Column Analysis do?

• Examines the data values themselves • Infers the true physical characteristics 

Data types, precision, scale, null, distinct, constant

• Highlights attribute settings that could be • Looks at every value in every column • Can prepare a data sample 

Sequential, nth Record, Random (Analysis Setting)

• Prepares a distribution file: 

Distinct values



Frequencies

• Allows analyst to choose actual characteristics that were found

Column analysis

© Copyright IBM Corporation 2016

What does Column Analysis do?

In Column Analysis, the source data is examined and Information Analyzer observes and records the data properties found in the actual data values. This includes data attributes such as data type, precision, scale, distinct values, etc. Column Analysis also prepares several files that will be used in subsequent analysis processes. Once the analysis process has run, it becomes the data analysts’ responsibility to review and accept or change the data characteristics that were found. This is known as the “Review” process.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

6-6

U n i t 6 C o l u m n A n a l ys i s

Why is this important?

• Presents the big picture to the analyst • Presents only the facts about your data • Gives the analyst a tool for communicating business questions and decisions

Column analysis

© Copyright IBM Corporation 2016

Why is this important?

Column Analysis is a large time saver. It is extremely difficult to analyze all of your data manually and because of this most projects have limited time frames allotted to the data discovery step - it’s usually not enough time. Information Analyzer allows the analyst to document or “take notes” throughout the review process which is a great tool for collaborating with other business users. When the analyst reviews the Information Analyzer results, they have the ability to accept what Information Analyzer has inferred or to choose different attributes.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

6-7

U n i t 6 C o l u m n A n a l ys i s

Structural integrity

• Data Definition analysis 

Data Type (for example, Integer, VarChar, and so on)



Data Precision (for example, field length)



Data Scale (for example, definition of any decimal positions)

• Data Type consistency: 

Multiple Data Types within single field − Similar

Types (for example, Tinyint, Smallint, Integer)

− Conflicting

Types (for example, Integer versus Char)

• Purpose: 

Conformity to expected data formats



Conformity to metadata (e.g.. the data values are consistent with the understanding of the data usage or data rules)

Column analysis

© Copyright IBM Corporation 2016

Structural integrity

Structural integrity is one of the key validations resulting from Column Analysis. Structural integrity addresses the consistency of data type within the actual data. Analysis should occur against defined metadata, as well as based on the actual information. Conflicting data types May indicate the presence of unexpected data from keying errors (i.e. an alpha character in US Social Security number) May indicate diverse data conditions (foreign and domestic postal codes)

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

6-8

U n i t 6 C o l u m n A n a l ys i s

Domain integrity

• Data Value or Frequency analysis: 

Null or Missing Values



Most/Least Frequent Values



Uniqueness



Constant Values

• Purpose: 

Conformity to expected data values



Scope/range of data values

Column analysis

© Copyright IBM Corporation 2016

Domain integrity

Domain integrity is one of the key validations resulting from Column Analysis. Review of the domain values should include identifying: • Null or missing values (depending on the type of information, missing or null values may compromise column usage). • Most/Least Frequent Values (default values usually occur with greater than normal frequency; anomalous data usually occurs at the low end of frequency). • Uniqueness (unique data elements are potential primary key candidates). • Constant Values (constant values may represent a specific data condition that needs to be addressed or simply a single occurrence where no other data of other valuation has occurred — the latter case may be a constant Country Code of ‘US’ where no foreign data exists at present).

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

6-9

U n i t 6 C o l u m n A n a l ys i s

Domain integrity: Do you know what the field contains?

• Purpose: 

Conformity to expected data values



Scope/range of data values

• Data Value or Frequency analysis: 

Null or Missing Values



Most/Least Frequent Values



Uniqueness



Constant Values

• Formats 

Frequency of formats

• Deliverables: 

Reports for analysis and review by Subject Matter Experts

Column analysis

© Copyright IBM Corporation 2016

Domain integrity: Do you know what the field contains?

Domain Integrity, each field (or domain) is checked for frequency of data occurrence, completeness, and validity. Based on validation with Subject Matter Experts, additional tests against data rules can be made. Domains include: Code fields, Identifiers, Dates, Quantities, Indicators, and Text fields.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

6-10

U n i t 6 C o l u m n A n a l ys i s

Domain integrity: What to look for?

• Identify what is relevant: 

Weed out the extraneous: − If

there is nothing there, it is either irrelevant or a gap

− Look

level

for data classification of Unknown or a single constant value at summary

• Identify what needs further exploration 

Take advantage of classifications − Identifiers,

Indicators, Codes, Dates & Timestamps, Quantities, and Text

• Annotate and report what is anomalous • Mark Reviewed what is done

Column analysis

© Copyright IBM Corporation 2016

Domain integrity: What to look for?

The presence of null values frequently shows up as a data classification labeled “unknown”. Just how you interpret this situation depends on your expectations, is this a data anomaly or a normal situation? You can explain your conclusions by creating notes. These notes can be attached to either the table or column. When you have finished with your analysis click the reviewed checkbox and then save.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

6-11

U n i t 6 C o l u m n A n a l ys i s

Analysis process

• • • •

Modify Analysis Settings, if needed Select tables or columns to analyze Execute Column Analysis Review the results: 

Understand core issues of Structural and Domain Integrity



Document decisions with Notes



Select attributes for target schema

• Iterate the process as necessary, with any Analysis Settings changes

Column analysis

© Copyright IBM Corporation 2016

Analysis process

Column Analysis is the first of many analysis processes that will be run. Regardless of which analysis job you want to run the general process is the same. First, you should review the configuration options related to the job you want to execute and set the appropriate thresholds. Second, select the tables you want to analyze and the server on which you want the process to run. Then, start the process and when it completes, verify that it completed without any errors. Now you are ready to review the results. You will open an Analysis Review window to assess the results, identify and document your business decisions, and determine the data attributes you wish to carry forward for your target database schema.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

6-12

U n i t 6 C o l u m n A n a l ys i s

Column Analysis: Step by step

• Invoke Column Analysis in lots of ways: • From the Investigate Pillar, choose the Column Analysis Task • From the Project Dashboard Getting Started Panel, select Analyze Columns

• From the Project Dashboard Analysis tab, select Column Analysis

Column analysis

© Copyright IBM Corporation 2016

Column Analysis: Step by step

Column analysis will produce the base profile for a particular column. This process must take place within the context of an open project.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

6-13

U n i t 6 C o l u m n A n a l ys i s

Column Analysis: Run Column Analysis

• Invoking Column Analysis: 

Click Investigate Pillar menu



Display list of registered data sources and the analysis status



Select Run Column Analysis in Task pane

Data Source Drilldown

Column analysis

Task List

© Copyright IBM Corporation 2016

Column Analysis: Run Column Analysis

From the Investigate pillar choose the column analysis option, select the objects upon which you want to perform column analysis, and then click the Run Column Analysis option located under the Task pane.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

6-14

U n i t 6 C o l u m n A n a l ys i s

Demonstration 1 Column Analysis

• Run Column Analysis on tables • Review results

Column analysis

© Copyright IBM Corporation 2016

Demonstration 1: Column Analysis

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

6-15

U n i t 6 C o l u m n A n a l ys i s

Demonstration 1: Column Analysis Purpose: This demonstration shows how to perform Information Analyzer column analysis. Column analysis examines data content at the column level within a record. This analysis is the first step in understanding your source data and will frequently reveal problems with data quality.

Task 1. Run Column Analysis for Customer and Vendor tables. 1.

Logon to Information Analyzer using the Information Server Console using student/student.

2.

Select the ChemCo project from the Projects list. Recall that analysis functions are performed in the context of a project. Double-click the ChemCo project to open it. Notice the tabs. The Dashboard tab is on the top with Details, Analysis, and Quality tabs underneath. Click the Analysis tab. This tab lists the data that is registered to your project and summarizes the progress of your data profiling effort. (There is not much to show yet.) From the Pillar menus bar, click Investigate > Column Analysis. Expand the Seq data source down to the files and then select Customer.txt. Click the Run Column Analysis option under the Task list located in the upperright portion of the window.

3. 4. 5.

6. 7. 8.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

6-16

U n i t 6 C o l u m n A n a l ys i s

9.

On the right-hand portion of the screen, verify that the Run Now radio button is on. (Do not click the Sample tab; you will learn more about this option later.)

10. Near the bottom right-hand portion of the window use the drop-down menu to click Submit and then click Submit again.

11. Place the cursor near the bottom of the window until a pop-up screen appears. 12. Click the Details button to view job run statistics. An ActivityStatus panel will appear.

If an error occurs, you will be notified in the Status column. You would then research the source of the error, fix the problem, and then rerun the job. When the job completes, the Status column will display Schedule Complete and a Summary panel will appear on the right-hand side that displays details for the job run when the job is selected.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

6-17

U n i t 6 C o l u m n A n a l ys i s

13. Click the Close button on the Summary panel. Note the column statuses in the CUSTOMER.txt table are now set to Analyzed.

14. Run Column Analysis on ALL the remaining tables using the same steps as the Customer table.

Task 2. Review Column Analysis for the Vendor table. 1. 2. 3.

Click Investigate > Column Analysis. In the Column Analysis tab, right-click the VENDOR table and click the Open Column Analysis option or you can click the Open Column Analysis option in the Tasks list. Take a few moments to review the information displayed on the View Analysis Summary panel. Note the red flags in the first of the detail columns. These flags indicate that the inferred properties for a column, as determined by Information Analyzer, differ from the formally declared column definitions (metadata from the Metadata import).

Note: If the View Analysis panel indicates that only 1 record was read from your vendor file, recall that you set a Where clause in a previous demonstration that created the condition: VENDORCODE = ASCO. If you forgot to remove the Where clause condition, then you will only see one record from this screen. If this is the case, go back to the previous demonstration instructions and remove the Where clause from the analysis settings and then rerun column analysis for the VENDOR table.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

6-18

U n i t 6 C o l u m n A n a l ys i s

For each column, the View Analysis Summary screen shows:

4.

5.

•

Totals: rows, columns

•

Cardinality

•

Data Class

•

Data Type

•

Length

•

Precision

•

Scale

•

Nullability

•

Cardinality Type

•

Format

•

Review Status

>> and > and 15 and Cardinality < Threshold (usually 98%)

Data profiling techniques

© Copyright IBM Corporation 2016

Assess identifiers

Identifiers will be primary key candidates in a later analysis process. Identifiers can be used in record linkage by association with foreign keys on related tables.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-13

Unit 7 Data profiling techniques

Review identifier properties

• Consistent metadata: 

Data Type: − Numeric



or Character

Length: Do not waste space! − Excessive − Select



length definition always has an impact on storage and processing

what is needed

Cardinality Type: ensure uniqueness is understood

Data profiling techniques

© Copyright IBM Corporation 2016

Review identifier properties

Ideally, an identifier column will have very consistent metadata in both data type and data links.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-14

Unit 7 Data profiling techniques

Review identifier domain values and formats

• Nulls and duplicates: 

Missing values prevent linkage of data



Duplicate values create incorrect linkages of data



Review Frequency Distribution and Domain Analysis

• Invalid format and value out-of-range: 

Both conditions may prevent correct handling of data



Review Format Analysis and Domain Analysis



Check Quintile Analysis for Low-end/High-end values

Data profiling techniques

© Copyright IBM Corporation 2016

Review identifier domain values and formats

Identifiers present some issues: • Data values should be unique. • No nulls are allowed. • If the source data does not conform to these two requirements than anomalies should be investigated and documented.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-15

Unit 7 Data profiling techniques

Verify indicators

• Indicators: 

Binary values (M/F, True/False, Yes/No, 0/1)



Most are Flags: − Often

trigger actions elsewhere

− Sometimes

set conditional situations (for example, only females have obstetric procedures)

• Note: Not all indicators classified correctly: 

Inferred as Code if # of Distinct Values > 2



Review naming - FLAG in metadata is a clue



Reset classification as needed

Data profiling techniques

© Copyright IBM Corporation 2016

Verify indicators

Indicators have a Cardinality=2. If nulls, spaces, or invalid values are also present the Indicator may be incorrectly classified as a code.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-16

Unit 7 Data profiling techniques

Review indicator properties

• Consistent metadata: 

Data Type − Usually



String or Integer

Length: concise versus clear − Excessive − Indicate − How



length definition has an impact on storage and processing

reductions, but keep understandable

much space needed to be clear and accurate? (Male/Female versus M/F)

Cardinality: Ensure constraint is understood − Neither

Unique nor Constant

Data profiling techniques

© Copyright IBM Corporation 2016

Review indicator properties

Indicators, like identifiers, should have consistent metadata.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-17

Unit 7 Data profiling techniques

Nulls and blanks in indicators

• Presence of nulls: 

When are null values beneficial?



More likely to impact correct system behavior - failure to trigger events



Document and report anomalies

Data profiling techniques

© Copyright IBM Corporation 2016

Nulls and blanks in indicators

The presence of nulls and blanks in indicators could be a normal condition that is, they do not trigger further action from the application system. However, this will depend on how the application was designed. If you see anomalies use the notes feature of Information Analyzer to document.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-18

Unit 7 Data profiling techniques

Skewing of indicator values

• Presence of skewed Indicators 

When are skewed values expected? − Flags

represent occasional domain events

• Active should be > Inactive 

Where are skewed values not expected? − Flags

represent equally distributed populations

• Male/Female should be roughly equal

• Document and report anomalies

Data profiling techniques

© Copyright IBM Corporation 2016

Skewing of indicator values

Data skew is identified as unequal distribution of data values. This may or may not be an expected condition.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-19

Unit 7 Data profiling techniques

Find and document indicator issues

• Understandability: 

Gender => which is understood? M/F or 0/1?



Add/Save definition to facilitate understanding.

• Accuracy 

Where do you verify? Need to identify relevant cross-reference.

• Consistency 

Migrations, mergers, transformations introduce multiple representations.

Data profiling techniques

© Copyright IBM Corporation 2016

Find and document indicator issues

Adding understandability to a column can be done by using terms from Information Analyzer or Business Glossary. Once the term has been created it can be attached to any data column or data table. More will be covered later in this course on this topic.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-20

Unit 7 Data profiling techniques

Validate codes (1 of 2)

• Codes: 

Finite set of values (most < 100 values)



Represent: − States

of action (Ordered, Cancelled, Shipped, and so on)

− Check

digits for other fields (0-9)

− Shorthand



Understandable − Status



for a reference (Zip Code = specific location)

Codes: A, I, X, D => What do values mean?

Concise: − How − Are

much space needed to be clear and accurate?

there value overlaps?

Data profiling techniques

© Copyright IBM Corporation 2016

Validate codes

The codes shown in this table represent decode values for a reference table.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-21

Unit 7 Data profiling techniques

Validate codes (2 of 2)

• Codes: 

Valid and accurate − What



is the master reference?

Consistent: − Standardized

set of values (for example, Country Code – ISO 2 or 3-digit)

− Migrations,

Data profiling techniques

mergers, transformations impact

© Copyright IBM Corporation 2016

If you have a master reference table that has been loaded into the source data analysis area you can reference this table in the domain and completeness tab. All values not present on the reference table will be flagged as invalid.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-22

Unit 7 Data profiling techniques

Review code properties

• Consistent metadata: 

Data Type: Constant − Usually



String or Integer

Length: concise versus clear: − Excessive − Indicate − How



length definition has an impact on storage and processing

reductions, but keep understandable

much space needed to be clear and accurate? (United Kingdom versus UK)

Cardinality: Ensure constraint is understood: − Neither − High

Unique nor Constant

Constant value suggests that domain is rarely utilized

Data profiling techniques

© Copyright IBM Corporation 2016

Review code properties

By their very nature codes are meant to be short versions of something larger. Therefore, codes should be short and concise. On the other hand, the shorter the code the less descriptive it is.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-23

Unit 7 Data profiling techniques

Nulls and blanks in codes

• Presence of nulls: 

When are Null values beneficial?



More likely to impact correct system behavior - failure to trigger events or set proper state

• Document and report anomalies

Data profiling techniques

© Copyright IBM Corporation 2016

Nulls and blanks in codes

Incorrect codes as represented by the presence of nulls is frequently a data mistake. They should be investigated and documented.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-24

Unit 7 Data profiling techniques

Skewing of code values

• Presence of skewed codes: 

When are Skewed values expected? − Codes

represent common versus occasional conditions

• Ordered and Completed should be substantially greater than Cancelled. 

Where are Skewed values not expected? − Codes

represent equally distributed populations

• Geographic codes

• Document and report anomalies

Data profiling techniques

© Copyright IBM Corporation 2016

Skewing of code values

Data skew can be seen from a couple of places in Information Analyzer – the chart view on the Frequency Distribution tab or the Show Quintiles button on the Domain and Completeness tab.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-25

Unit 7 Data profiling techniques

Find and document code issues

• Understandability: 

Order Status => which is best understood? Ordered/Shipped/Cancelled/Completed or O/S/X/C?



Add/Save Definition to facilitate understanding.

• Accuracy 

Where do you verify? Need to identify relevant cross-reference or master reference.

• Consistency 

Migrations, mergers, transformations introduce multiple representations.

Data profiling techniques

© Copyright IBM Corporation 2016

Find and document code issues

Codes can be verified by Information Analyzer by using the reference table option on the Domain and Completeness tab.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-26

Unit 7 Data profiling techniques

Assess quantities (1 of 2)

• Quantities: 

Potentially infinite set of numeric values (integers, decimals, floating values; positive or negative) − Quantities,



Are these externally entered or calculated? − Any





prices, currency values

defaults?

Valid: − What

is the acceptable range of values?

− What

are the outliers?

Accurate: − Price-type − Can

values may work similarly to codes

be assessed through data rules (equations, aggregations)

Data profiling techniques

© Copyright IBM Corporation 2016

Assess quantities

If the quantity is a value that is calculated from existing columns you can use Information Analyzer data rules to verify validity. Use the range function available on the Domain and Completeness tab to help specify minimum and maximum values for a column.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-27

Unit 7 Data profiling techniques

Assess quantities (2 of 2)

• Note: Not all Quantities classified correctly: 

Inferred as code if # of distinct values is low



Inferred as Text if source is flat file, # of distinct values is high, and presence of nulls/spaces impacts classification



Review naming – VAL, QTY, PRC in metadata is a clue



Reset classification as needed

Data profiling techniques

© Copyright IBM Corporation 2016

The example depicted on the slide shows that the data class for the QTYORD has been assigned to code; this is incorrect and was caused because the Cardinality is low. In this case the data class should be overridden to quantity.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-28

Unit 7 Data profiling techniques

Review quantity properties (1 of 2)

• Consistent metadata 

Data Type: Numeric: − Integer,

Decimal, Float

− Review

consistency of representation

− Flat

file sources may be seen as character or string data instead of numeric

− Watch

for unknown data type—signals presence of null or space values

Data profiling techniques

© Copyright IBM Corporation 2016

Review quantity properties

Quantities can be represented by several data types use the appropriate properties tab to select your choice. Unknown data types usually indicate the presence of nulls.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-29

Unit 7 Data profiling techniques

Review quantity properties (2 of 2)

• Consistent metadata: 

Precision: Total numeric length − Numeric

data identified as String or Character data type will show Length not Precision

− Review



defined precision length versus utilized

Scale: Decimal length − Numeric − Review

Data profiling techniques

data identified as String or Character data type will show no Scale value

defined scale versus utilized

© Copyright IBM Corporation 2016

Quantities can assume a wide range of values and usually have high cardinality.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-30

Unit 7 Data profiling techniques

Nulls, spaces and zeroes in quantities

• Presence of nulls and spaces: 

Impacts inferencing of data classification



Might impact correct system behavior – failure to be correctly reported

• Presence of zeroes: 

Numeric but is it valid?



If incorrect, likely to impact calculations in other quantities

• Document and report anomalies

Data profiling techniques

© Copyright IBM Corporation 2016

Nulls, spaces and zeroes in quantities

Quantities represent values that are usually numeric and, therefore, should be represented by a significant value. Sometimes a zero is a significant value and sometimes it is not. Nulls and spaces can represent problems in the data that should be documented and reported to the project team.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-31

Unit 7 Data profiling techniques

Skewing of quantity values

• Presence of skewed Quantities: 

When are Skewed values expected? − Common

versus occasional conditions:

• Most individual orders are small; institutional orders might be high - but are rare • Most salaries are within a typical range, but outliers not unexpected 

Where are skewed values not expected? − Quantities

that represent standard rates or fairly constant values:

• Shipping charges • Tax rates

• Document and report anomalies

Data profiling techniques

© Copyright IBM Corporation 2016

Skewing of quantity values

Skewed values are frequently encountered in certain types of numerical fields. This skewing should be reviewed by a subject matter expert and, if identified as a problem, documented.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-32

Unit 7 Data profiling techniques

Find and document quantity issues

• Accuracy: 

What is reasonable? − Review

distribution

− Negative

values – Should there be any?

• Asset values, prices, salaries are not negative • Sales may include returns, which would be represented as negative − High



values - What is the maximum allowed?

Where do you verify? Need to identify relevant documentation of valid range.

• Consistency 

Calculations might introduce inconsistent values

Data profiling techniques

© Copyright IBM Corporation 2016

Find and document quantity issues

Note that when viewing a quantity column the Domain type default is set to range. Values that are either very high or very low will be reflected in the outlier's column.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-33

Unit 7 Data profiling techniques

Analyze dates

• Dates: 

Generally bounded set of calendar dates or timestamps



Are these externally entered or calculated? − Any





defaults?

Valid: − What

is the acceptable range of values?

− What

are the outliers?

Accurate − Different

Data profiling techniques

characteristics for different dates in differing situations

© Copyright IBM Corporation 2016

Analyze dates

True date fields can be incorrectly classified as string in the presence of nulls.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-34

Unit 7 Data profiling techniques

Review date properties

• Consistent metadata: 

Data Type: Date − Review − Flat

file sources might be seen as character or string data instead of date

− Watch



consistency of representation

for unknown data type – signals presence of null or space values

Length: − Standard − Watch



length for date = 8

for inconsistent lengths

Format: − Common

problem with dates

• Multiple representations − Validate

consistency of format usage

Data profiling techniques

© Copyright IBM Corporation 2016

Review date properties

In addition to inconsistent lengths dates can have inconsistent formats. For instance, some data might have the yyyymmdd format and other values might be mmddyyyy. This inconsistent format problem can result in the date being incorrectly classified.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-35

Unit 7 Data profiling techniques

Nulls, spaces and zeroes in dates

• Presence of nulls and spaces: 

Impacts inferencing of data classification



Might impact correct system behavior - failure to be correctly reported

• Presence of zeroes or defaults: 

Numeric but is it simply a default?



If incorrect, likely to impact usage



Look for one or a couple high frequency values (for example, 19000101, 19500101) − Value

is a valid date, but is really a default

• Document and report anomalies

Data profiling techniques

© Copyright IBM Corporation 2016

Nulls, spaces and zeroes in dates

The data example depicted here has several problems. First of all we see the presence of both nulls and spaces. In addition, some records are in a different date format.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-36

Unit 7 Data profiling techniques

Skewing of date values

• Presence of skewed quantities: 

When are Skewed values expected? − Cyclical

occurrences:

• Billing cycle dates – once per month • Salary pay dates 

Where are Skewed values not expected? − Dates

that represent standard entry points:

• Entry/creation dates • Birth dates in a general population − Watch

for default entries

• Document and report anomalies

Data profiling techniques

© Copyright IBM Corporation 2016

Skewing of date values

Operational processes frequently cause data skew; that is, spikes in the data values.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-37

Unit 7 Data profiling techniques

Find and document date issues

• Accuracy: 

What is reasonable? − Review



distribution

− What

is the oldest date?

− What

is the most recent date?

Where do you verify? Need to identify relevant documentation of valid range.

• Consistency 

Should have consistent format

Data profiling techniques

© Copyright IBM Corporation 2016

Find and document date issues

If you find date issues you should document them. Are the dates accurate? Are the formats of the dates consistent?

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-38

Unit 7 Data profiling techniques

Review text fields

• Text data: 

Usually free-form data (Names, Addresses, and so on) − Expectation



Are there frequently occurring values? − Any



is that most are unique - review Cardinality

defaults?

Focus on data formats: − Are

there common formats?

− Special

characters such as: (,),/,#,*

• Special processing conditions (for example, * indicates special code to execute) − Statements − Lack



such as: DO NOT USE

of standardization

Document and report anomalies

Data profiling techniques

© Copyright IBM Corporation 2016

Review text fields

Free-form text fields, although often unique, do not make good identifiers. Ideal identifiers are small and numeric; this saves storage space in databases.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-39

Unit 7 Data profiling techniques

Additional text field considerations

• Usage of Text fields: 

Commingled domains? − Might

indicate a requirement for domain/field conditioning:

• AddressLine1 - is it street only or does it include city/state or a contact? • What data is embedded? 

Single or multiple entities/subjects? − Might

indicate an impact for matching/linkage of data

• Name - is it a single individual name, an organization or legal entity, or does it contain multiple names? 

Document and report anomalies



Recommend additional analysis  QualityStage Text Analysis and Standardization

Data profiling techniques

© Copyright IBM Corporation 2016

Additional text field considerations

Data with domain issues - such as address data in a name column - can be analyzed and restructured using QualityStage.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-40

Unit 7 Data profiling techniques

Summary

• Focus on process: 

Investigation results that are analyzed and validated: − Metadata − Domain

Integrity

Integrity

− Structural

Integrity

− Relational

Integrity

− Cross-Table/cross

domain Integrity



Key findings identified and linked to business objectives



Discovery reiterated, as required, to support business information



Data content and validation reports and documents created



Recommendations for data reconciliation made



Executive summary prepared



Final review/delivery provided

Data profiling techniques

© Copyright IBM Corporation 2016

Summary

Data analysis should be a process that is repeatable. Use the column’s data class as a starting point for different analysis and review paths. Document findings at each decision point. These notes can then be used to produce reports to communicate with the project team.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-41

Unit 7 Data profiling techniques

Checkpoint

• US State codes are an example of what?

Data profiling techniques

© Copyright IBM Corporation 2016

Checkpoint

Answer the checkpoint questions to test your mastery of the material presented.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-42

Unit 7 Data profiling techniques

Checkpoint solutions

• US State codes are an example of what? 

Data Class

Data profiling techniques

© Copyright IBM Corporation 2016

Checkpoint solutions

Answers to the checkpoint questions are provided here.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-43

Unit 7 Data profiling techniques

Demonstration 1 Data classification

• • • • • •

Using IA to find sensitive data Using IA publish data classes Using IGC find a specific table column with multiple classes Using IA mark one of the classes as invalid Using IA publish new classes Using IGC examine changes caused by previous steps

Data profiling techniques

© Copyright IBM Corporation 2016

Demonstration 1: Data classification

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-44

Unit 7 Data profiling techniques

Demonstration 1: Data classification Purpose: Use Information Analyzer to find sensitive data.

Task 1. Analyze Customer_classes metadata. 1.

Log into Metadata Asset Manager using student/student.

2.

Using the techniques you learned in Unit 4 take the following actions: • Open the SAMPLE import area • Staged Imports tab • Reimport option • Next • Next • Reimport • Analyze • Preview • Share to Repository • Exit Metadata Asset Manager

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-45

Unit 7 Data profiling techniques

3.

Open the Information Server Console as student/student.

4.

Open the IAProj project and from the Overview pillar choose Project Properties.

5.

Choose the Data Sources tab.

6.

Choose the Add button in the bottom right hand corner.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-46

Unit 7 Data profiling techniques

7.

From the drop downs in the Select Data Sources listing choose CUSTOMER_CLASSES and click OK.

8.

Once successful you will be taken back to the data sources listing. Close this down (saving if prompted) and choose the Dashboard from the Overview pillar.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-47

Unit 7 Data profiling techniques

9.

Under the Analysis tab you will now see the Customer_classes data source has been added to the project

10. Under the Investigate pillar choose Column Analysis. 11. Highlight CUSTOMER_CLASSES and then choose Run Column Analysis in the right hand pane. 12. Click the Submit and Close button at the bottom right hand corner. 13. Once the job has completed, refresh the screen display. 14. Note that Column Analysis finished, but the status is only 95.45%. 15. Expand CUSTOMER_CLASSES and click on the Sequence Column to sort by database column order

16. Note that there is one column, CREDITCARD_HISTORY, which has a status of Cannot Analyze. It is a column that contains XML, so it can be ignored for this lab. © Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-48

Unit 7 Data profiling techniques

17. Select Open Column Analysis on the right hand Tasks menu. 18. Review the Summary to see all the information automatically obtained with Column Analysis. 19. Select View Details at the bottom right. 20. Select the EMAIL_ADDRESS column on the left side. 21. Select the Data Class tab. You will see that there are some unrecognized emails that default to a status of 'Invalid'. 'Selected' is automatically the 'best' data class.

22. Select the Email Address Data Class. 23. Note that there are 3512 records with an email address, but only 3511 example values on the right side. This is because the sample values do not show duplicates. Use the Frequency Distribution tab to see the duplicates. Note also that you can Drill Down (bottom right) to the entire record for any data value

24. Select the COUNTRY_CODE from the Select View list on the left. Note that the Percent for Country Code does not add up to 100%. This is intentional and caused by the fact that many US State Codes look like Country Codes, too (for example: CA = California = Canada).

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-49

Unit 7 Data profiling techniques

25. Select the US State Code Data Class to see a sample on the right

26. Select the Country Code Data Class. Note that DE is not there. This is also by design - each value only shows up in one Data Class. The values in the Data Value column may be in a different order than that shown below, but this is acceptable:

27. Select Country Name. You will see that the values are country codes, not country names. 28. Select Code and you will see there are no values.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-50

Unit 7 Data profiling techniques

29. Sort the values in the Data Class column in descending order (so that US State Code appears at the top), select the Status button, and then choose the Mark as Invalid option.

30. Code has no values and will remain as valid. Country Name has two values which reside in both Country Code and Country Name data classes. It will be marked as Invalid to prevent conflicts.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-51

Unit 7 Data profiling techniques

31. Sort the values in the Data Class column in ascending order (so that Code appears at the top). The results appear as follows (notice that only Country Name should be marked as invalid):

32. Save the changes. 33. Review POSTAL_CODE.

34. Why is the selected class 'Unknown' when over 70% of the values are valid zip codes? Because the default threshold is 75%. 0% is less than 75%, so it defaults to 'Unknown'. 35. Change the Selected class to US Zip. 36. Select Investigate > Publish Analysis Results.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-52

Unit 7 Data profiling techniques

37. Select CUSTOMER_CLASSES and the Publish Analysis Results > Current Analysis > OK.

Task 2. Review Data Classifications in Information Governance Catalog. 1. 2.

Start the IIS Server Launchpad and then log into Information Governance Catalog as student/student. Select Information Assets > Data Classes.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-53

Unit 7 Data profiling techniques

3.

Select Email Address.

4.

Data Class Details will now be shown for Email Addresses.

5.

Select Queries.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-54

Unit 7 Data profiling techniques

6.

Select Data Classes and their Classifications.

7.

Click on the of the COUNTRY_CODE link.

8.

Review the General Information. This is what was identified in Information Analyzer.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-55

Unit 7 Data profiling techniques

9.

HOVER (do not click) over Analysis > COUNTRY_CODE. This is analysis summary information, not the detail. The results appear similar to the following:

10. Select Queries again. 11. Select Database Columns and their Data Classifications.

12. Select List Options.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-56

Unit 7 Data profiling techniques

13. Under Save List To File, click the Save as Data Format (CSV) - Default Encoding link to export the list (save it to the default location when prompted).

Task 3. Create a new data class.

1.

2.

The Business needs to mask the Salesman_id. These numbers can be used to obtain information about the salesperson. The Business says that these identifiers will always be: Two alpha followed by three numeric (for example, NY150). Using the Information Server Console log into Information Analyzer as student/student and review your previous column analysis for the Customer_classes data source. Save your previous changes if prompted.

Id numbers are always AA999. This confirms what the Business said. Go back to the IGC.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-57

Unit 7 Data profiling techniques

3.

Select Information Assets > Create Data Class

4.

Fill in the screen as shown:

5.

Click Save > Save and Edit Details.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-58

Unit 7 Data profiling techniques

6.

7.

8. 9. 10. 11.

Fill out the definition as shown:

The Regular Express is: ^[A-Z]{2}[0-9]{3}$ A field that ONLY contains two upper case alpha characters followed by three numeric characters Make sure you Enable the new class and set it to True. Also provide an example as shown below

Save the new Data Class. Go back to Information Analyzer. Re-run Column Analysis on the CUSTOMER_CLASSES table. Review the result. Note that the 'Selected' is 'Code' because it was already published with that class

12. Change the Selected to Salesperson and Save.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-59

Unit 7 Data profiling techniques

13. 14. 15. 16. 17.

Select Investigate > Publish Analysis Results. Select the CUSTOMER_CLASSES table. Select Publish Analysis Results. Click OK. Go back to Information Governance Catalog and review the results. By Data Class or Using a query or By Column. See how the new data class has been implemented.

Results: You used Information Analyzer to find sensitive data.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-60

Unit 7 Data profiling techniques

Unit summary

• Describe the methodology used to drive data analysis • Use the methodology to guide your data analysis activities

Data profiling techniques

© Copyright IBM Corporation 2016

Unit summary

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-61

Unit 7 Data profiling techniques

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

7-62

Unit 8

Table analysis

Table analysis

Information Analyzer v11.5 © Copyright IBM Corporation 2016 Course materials may not be reproduced in whole or in part without the written permission of IBM.

U n i t 8 Ta b l e a n a l ys i s

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-2

U n i t 8 Ta b l e a n a l ys i s

Unit objectives

• Perform table analysis • Identify both single column and multi-column primary keys

Table analysis

© Copyright IBM Corporation 2016

Unit objectives

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-3

U n i t 8 Ta b l e a n a l ys i s

Keys

• Primary: 

Identifies a record, must be unique, relates columns within a single table

PK

Descriptive Field A

Descriptive Field B

• Natural: 

Identifies a record



Allows you to have multiple natural keys that relate to multiple Foreign Key columns

• Foreign 

Relates a descriptive column on one table to the primary key column on another table; this establishes a relationship between the two tables

Table analysis

© Copyright IBM Corporation 2016

Keys

Keys fall roughly into three categories: primarily, natural, and foreign. All descriptive attributes relate directly to the primary key. Primary keys uniquely identify a record, natural keys also identify a record and are a normal part of the data, and foreign keys relate a descriptive column on one table to the primary key on another table.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-4

U n i t 8 Ta b l e a n a l ys i s

Primary Key determination

• Determine which column, or combinations of columns, drive determination of all other columns’ data

• Understanding of key columns useful for future database model, possibly the target for your project

PK

Table analysis

Field A

Field B

© Copyright IBM Corporation 2016

Primary key determination

To find a primary key you must determine which column, or combination of columns, will uniquely define the record. Finding primary keys is frequently necessary before loading records to a target database.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-5

U n i t 8 Ta b l e a n a l ys i s

Key analysis overview: Primary Keys

• Supports both Single Column and Multi-Column Primary Key Analysis: 

Applies Column Analysis results directly to Single Column Key Analysis − No



additional steps necessary to start reviewing

Allows you to control assessment of Multi-Column Key Analysis: − Run

direct against Full Volume or against a Data sample

− You

can select up to 9 columns to include in Multi-Column Key Analysis

− You

review potential candidates based on data sample results

− For

selected candidates, additional processing generates a Frequency Distribution for all values (in the full file) in the Multi-Column Key Analysis

• User Review: 

Allows you to drill down into the Frequency Distribution to understand column details.



Allows you to review and understand Duplicates in the columns or multicolumn candidates.



You can accept a single column or a multi-column as the chosen Primary Key.

Table analysis

© Copyright IBM Corporation 2016

Key analysis overview: Primary Keys

Ideally a primary key would be both numeric and small and have a Cardinality equal to the number of records in the input file. Information Analyzer provides fast and efficient means to find primary keys for both single column and multi-column primary key analysis. Once Information Analyzer has discovered possible combinations, the Data Analyst can choose the most desirable one.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-6

U n i t 8 Ta b l e a n a l ys i s

Primary Key: Walkthrough

• Invoke Primary Key Analysis: 

From the Investigate Pillar menu. choose the Key and cross domain Analysis task



From the Project Dashboard Getting Started Panel, select the Key and cross domain Analysis button

Table analysis

© Copyright IBM Corporation 2016

Primary Key: Walkthrough

Primary key analysis can be initiated in several ways. The dashboard provides one approach and the Investigate pillar provides another. This slide shows both paths.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-7

U n i t 8 Ta b l e a n a l ys i s

Primary Key analysis: Single column key details

• Single column primary key analysis: 

Displays Table Summary, Column Details, flags candidates



Options to drill into Frequency Distribution or check identified Duplicates



Can Accept or Remove Primary Key status indicator

Table analysis

© Copyright IBM Corporation 2016

Primary Key analysis: Single column key details

Most single column primary key candidates will have a data class of identifier. The green flag shown on the graphic above points to a candidate primary key column.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-8

U n i t 8 Ta b l e a n a l ys i s

Single column key duplicates

• Reviewing Duplicates: 

View summary of distinct and duplicated values



Display list of all Primary Key values and #/% duplicated

Table analysis

© Copyright IBM Corporation 2016

Single column key duplicates

If duplicates exist in a strong primary key candidate then these duplicates should be thoroughly explored and documented. Information Analyzer provides the Data Analyst with screens where they can review existing duplicates and resolve data anomalies.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-9

U n i t 8 Ta b l e a n a l ys i s

Multi column key analysis With Data Sample MultiColumn Primary Key Analysis

Evaluated Results

Duplicate Check

With Full Volume

Multi-Column Primary Key

Evaluation Options

Results

• Allows selection of specific columns to assess • Supports up to 9 columns combined • Evaluates all group combinations up to total # of columns selected:

• Full Volume:

• Assessment of % Unique, % Null, and % Duplicate based on evaluation options • Can run a Duplicate Check to validate Full Volume and identify duplicated keys

– With 3 columns (A, B, C) selected, tests the column combinations AB, AC, BC, and ABC. – Number of tests required grows significantly as more columns are included.

Table analysis

– Creates Virtual Columns – Gives complete assessment immediately – More processing so choose if likely combination is known

• Data Sample:

– Sample must be created first – Tests permutations against sample – Choose if data is largely unknown

© Copyright IBM Corporation 2016

Multi column key analysis

If Multi-Column primary key analysis is required to identify a unique combination of columns, then two paths exist to find the correct combination. The upper path first creates a data sample. Next, primary key analysis is run against the data sample. From this data a candidate primary key combination is established. Using the candidate primary key combination, run a duplicate check; this will automatically run against the full data source, not the data sample. From those results you should be able to decide whether or not you want to keep the primary key candidate. The lower path does not create a data sample but instead runs against the full data volume. Both methods will eventually yield the same result but the first method is more computer efficient; that is, less work is required.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-10

U n i t 8 Ta b l e a n a l ys i s

Multi Column Primary Key

• Multi-Column Primary Key Analysis: 

Can be done against whole table or a Data Sample



Checks potential of n-columns to be Primary Keys



Checks selected Multi-Column Candidates against full volume for Duplicates

Table analysis

© Copyright IBM Corporation 2016

Multi column Primary Key

The multi-columned tab on the primary key analysis screen allows you to choose either method one or method two (depicted on the previous graphic) – that is, first create a sample or run against the full data volume.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-11

U n i t 8 Ta b l e a n a l ys i s

Data sampling

• Create Data Sample: 

Opens the job schedule



Enter Job Name



Confirm Sample Size



Choose Sample Method



Select Run Now



Click Submit button

Table analysis

© Copyright IBM Corporation 2016

Data sampling

Information Analyzer provides tools for the Data Analyst to extract sample data from the full dataset. This slide documents the tasks required to create a data sample.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-12

U n i t 8 Ta b l e a n a l ys i s

Sampling methods

• Sample Methods: 

Random: Selects a random set of rows.



Sequential: Selects the first x rows of data from the table where x = sample size.



Nth: Selects every Nth record where N is an interval defined.

Testing a Random Sample: Test a random sample of data records from the data environment. The relative size of the sample (n) compared to the full size of the data environment is not the major factor in the reliability of the statistical sampling methodology. This methodology will work with any size sample. For evaluating data, a sample size from 1,000 to 10,000 records is adequate for a data environment of any size.

Table analysis

© Copyright IBM Corporation 2016

Sampling methods

To create a data sample you choose parameters that direct how that sample will be formed. Choose one of the sampling methods here. You can then test using your data sample.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-13

U n i t 8 Ta b l e a n a l ys i s

Data sample properties

• Sample Created: 

View sample status



Can view sample details.

Table analysis

© Copyright IBM Corporation 2016

Data sample properties

The fact that a data sample exists is recorded with its creation date, as shown on this graphic.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-14

U n i t 8 Ta b l e a n a l ys i s

Run analysis

• Multi Column Primary Key Analysis: 

Enter Composite Maximum value – this is the maximum number of columns to include in a combination test.



Select Columns to include.

Table analysis

© Copyright IBM Corporation 2016

Run analysis

Select all columns that will be used to form the composite key; also note the composite max parameter. Note that we avoided including columns with null values present

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-15

U n i t 8 Ta b l e a n a l ys i s

View results of multi-column key analysis

• Multi Column Primary Key Analysis: 

Open the Primary Key workspace and select the Multi-Column tab.



Candidate combinations with Uniqueness above target threshold are Flagged.



For strong candidates, can review Frequency Distribution or run a Duplicate Check against full volume.

Table analysis

© Copyright IBM Corporation 2016

View results of multi-column key analysis

Candidate combinations that meet or exceed the flag percentage minimum will be green-flagged. The flag percentage minimum can be adjusted and the apply button clicked to rebuild this screen. In this example no 2-column combinations meet the primary key threshold.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-16

U n i t 8 Ta b l e a n a l ys i s

Duplicate check result

• Multi Column Primary Key Analysis: 

Duplicate Check validates Primary Key candidate for full volume AND generates a Virtual Column



Select Run Duplicate Check button, then view result

Table analysis

© Copyright IBM Corporation 2016

Duplicate check result

The column sequence can be changed - use the arrow keys to adjust the sequence. The duplicate check runs against the full volume of data – therefore, it rereads the full source data and provides the most accurate results.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-17

U n i t 8 Ta b l e a n a l ys i s

Duplicate check

• Important note: 

Duplicate Check runs against full volume of data.

• When most appropriate to run: 

Column Analysis on a Single-column Primary Key or Multi-column Primary Key Analysis was executed on a data sample, and 100% uniqueness returned. − Running



Duplicate Check ensures the key is truly unique against full volume data.

Any key analysis was executed against full volume data, and < 100% uniqueness returned.

Table analysis

© Copyright IBM Corporation 2016

Duplicate check

Very important - the duplicate check reads the full volume data source. Statistics from both the original run against the data sample and the run from the duplicate check will be stored and displayed.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-18

U n i t 8 Ta b l e a n a l ys i s

Basic data profiling techniques in practice

Table analysis

© Copyright IBM Corporation 2016

Basic data profiling techniques in practice

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-19

U n i t 8 Ta b l e a n a l ys i s

Determine structural integrity

• Review Single Column uniqueness first 

Look at Identifier fields

• Review Multi-Column uniqueness as needed: 

Unless exact key is known, start with a data sample



Evaluate 2 or 3 column combinations first



Typically focus on the initially sequenced columns – most likely the ones used as keys



Do not examine too much at once



Expand view only as needed to more column combinations or with additional column selections

Table analysis

© Copyright IBM Corporation 2016

Determine structural integrity

Important performance note - examine small number of combinations to begin with and expand the number of columns involved only as necessary. Multi-Column key analysis is costly.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-20

U n i t 8 Ta b l e a n a l ys i s

Structural integrity: Is the structure usable?

• Example Findings: There is no single unique key for Order Detail. The

combination of OrderID and ItemNo does produce a nearly unique key, but there are 2 values duplicated.

Table analysis

© Copyright IBM Corporation 2016

Structural integrity: Is the structure usable?

Annotate any anomalies as documentation for the ETL developers.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-21

U n i t 8 Ta b l e a n a l ys i s

Checkpoint

• True or False? Primary Key analysis requires 100% uniqueness of data values in the designated key column.

• True or False? Primary Keys can be either one or two columns. • True or False? Primary Keys must be determined with a project context.

Table analysis

© Copyright IBM Corporation 2016

Checkpoint

Answer the checkpoint questions to test your mastery of the material presented.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-22

U n i t 8 Ta b l e a n a l ys i s

Checkpoint solutions

• True or False? Primary Key analysis requires 100%

uniqueness of data values in the designated key column.

• False

• True or False? Primary Keys can be either one or two columns.

• False

• True or False? Primary Keys must be determined with a project context.

• True

Table analysis

© Copyright IBM Corporation 2016

Checkpoint solutions

Answers to the checkpoint questions are provided.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-23

U n i t 8 Ta b l e a n a l ys i s

Demonstration 1 Primary key analysis

• Run single column and multi-column Primary key analysis

Table analysis

© Copyright IBM Corporation 2016

Demonstration 1: Primary key analysis

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-24

U n i t 8 Ta b l e a n a l ys i s

Demonstration 1: Primary key analysis Purpose: Determine the Primary Key for each Chemco table. Determining Primary Keys is a fundamental activity performed prior to finalizing a target database data model. IA has numerous tools to discover these keys and can perform the analysis in a very efficient manner. This demonstration teaches how to discover both single-column and multicolumn Primary Keys. A methodology is employed for multi-column key discovery.

Task 1. Single column primary key analysis. 1. 2.

Launch Information Server and then open the Chemco project. Open Key and Cross-Domain Analysis from the INVESTIGATE pillar menu.

3. 4. 5.

Expand the lines and then select the VENDOR table. Click the Open Key Analysis option in the Tasks pane. View the Key analysis summary and then verify that the Single Column tab is selected, which is the default.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-25

U n i t 8 Ta b l e a n a l ys i s

Each candidate column has a green flag; your results may be different depending on choices you made in column analysis.

The green flag marks columns having a unique % value greater than the value in the Flag Percentage Minimum box. Note the Null % column - the VENDORCODE has a large number of NULLvalued records. 6. One at a time, select each column and then click Primary Key Analysis - View Frequency Distribution. Note the Data Class values for VENDNO and VENDORNAME. Since both columns have 100% uniqueness, why is there a difference? The answer is the column lengths. 7. For each row, click the Primary Key Analysis drop-down and the click View Duplicate Check. Since both the VENDNO and VENDORNAME columns report 100% uniqueness, you would not expect any duplicates. 8. Apply this function on the VENDORCODE field. 9. Return to the Key Analysis main screen. You have two candidate columns for Primary Key. VENDNO is the smaller of the two candidates. 10. Click the VENDNO column and then click Key Status and then select Mark as Primary Key. VENDNO is the best candidate because it has 100% uniqueness and a small length. © Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-26

U n i t 8 Ta b l e a n a l ys i s

11. Click the Close button. 12. Select the ORD_HDR table and click the Open Key Analysis task. Note this table has only one candidate (ORDERID column). 13. Select this column as the Primary Key and return to the table picker screen. 14. Select the CUSTOMER table and select Open Key Analysis on the Single Column tab. Only one candidate column has a green flag – the CUSTID column. However, the uniqueness percent is not quite 100%. Recall the CUSTOMER table problem that you identified in an earlier demonstration; you found duplicates in the CUSTID field. 15. Select the CUSTID column, click Primary Key Analysis, and then select View Duplicate Check. This problem was identified and documented in an earlier demonstration, but you could now write another note and attach it at the table level. 16. From the Key Analysis screen, accept the CUSTID field as the Primary Key. 17. Click Close and then click Close again to return to the Key and Cross-Domain Analysis screen.

Task 2. Multi-column PK analysis.

1. 2. 3. 4. 5.

You will use the data sampling strategy for this portion of the demonstration. You will first create a data sample, and then submit a job to test combinations of columns. You will next identify candidate combinations and then run a duplicate check using each candidate. Recall that the duplicate check will run against the full data source; therefore, % uniqueness results from the data sample run will likely differ from the duplicate check’s % uniqueness. Click the ORD_DTL table to select it. Click the Open Key Analysis task. Note that no single column has been identified as a Primary Key candidate. Review the Uniqueness % for each column. There are not any clear-cut Primary Key candidates. Click the Multi-Column tab. Click the Analyze Multiple Columns button at the bottom right.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-27

U n i t 8 Ta b l e a n a l ys i s

6.

In the Select columns to include pane, choose the following columns for analysis and ensure the Composite Key Maximum is set to 2. • STOCKCODE • COMPLETE • ITEMNO • ORDERID

7.

Click the Submit button. Once the job has completed the window returns to the multi-column pane. Two column combinations are now flagged as candidates:

8.

If you do not see the flag settings, then set the Flag Percentage Above value to 99.99 and then click the Apply button.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-28

U n i t 8 Ta b l e a n a l ys i s

9. 10. 11. 12.

13. 14.

15. 16.

View the Uniqueness % in the Analysis Statistics column of the report. Evaluate each key columns combination. Select the first column combination and then click Primary Key Analysis and then click Run Duplicate Check. Take all defaults and click the Submit button to run the duplicates check job for this column combination. After the job finishes, return to the Multi-column screen, select the first column combination and click Primary Key Analysis and then select View Duplicate Check. The Duplicate Check columns should now have a green bar. Run a duplicate check for the second column combination and then view the results. Select the ORDERID ITEMNO column combination for the Primary Key, click Key Status and then click the Mark As Primary Key button in the lower-right portion of the screen. The ORDERID ITEMNO combination was selected because it resulted in the highest uniqueness %. Document the data duplication anomaly using the notes function. Click the Close button.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-29

U n i t 8 Ta b l e a n a l ys i s

Task 3. Perform primary key analysis for the remaining tables. 1.

Using the techniques you learned in this demonstration, perform a Primary Key analysis on all of the other tables in the Chemcoseq data source. If you are given a choice between two columns, each of which has 100% uniqueness, use the data class and length as a determining factor - choose the one with the smallest length. Use Multi-Column key analysis when necessary and use the data sample technique for tables having more than 500 records. If necessary, change the value in the Composite Key Maximum box to a number higher than 2.

Note: Ensure that you have run column analysis for all the tables in the Chemcoseq data source, as per the instructions in Unit 6. Table

2.

Primary Key

Notes

CUSTOMER CUSTID Duplicates exist CREDIT_RATING ORD_HDR ORDERID ORD_DTL ORDERID+/ITEMNO Duplicates exist ITM_MSTR ITM_SPLR VENDOR VENDNO MSTRCTLG UNITCTLG CARRIER PARTTBL After completion, close Information Server Console.

Results: You determined the Primary Key for each Chemco table. This demonstration showed how to discover both single-column and multicolumn Primary Keys.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-30

U n i t 8 Ta b l e a n a l ys i s

Unit summary

• Perform table analysis • Identify both single column and multi-column primary keys

Table analysis

© Copyright IBM Corporation 2016

Unit summary

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-31

U n i t 8 Ta b l e a n a l ys i s

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

8-32

Unit 9

Cross Table Analysis

Cross Table Analysis

Information Analyzer v11.5 © Copyright IBM Corporation 2016 Course materials may not be reproduced in whole or in part without the written permission of IBM.

U n i t 9 C r o s s Ta b l e A n a l ys i s

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-2

U n i t 9 C r o s s Ta b l e A n a l ys i s

Unit objectives

• Perform foreign key analysis • Perform cross-domain analysis

Cross Table Analysis

© Copyright IBM Corporation 2016

Unit objectives

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-3

U n i t 9 C r o s s Ta b l e A n a l ys i s

What is cross table analysis?

• Compares distinct values from a column in one table against distinct values from columns in other tables 

Note: If one of the columns being compared is the Primary Key on its table, then you could be discovering a Foreign Key

• Goal is to detect columns that share a common domain of values • Importance: 

Identifies potential Foreign Key relationships



Identifies redundant data between tables



Identifies potential referential integrity issues



Might uncover unknown business issues

Cross Table Analysis

© Copyright IBM Corporation 2016

What is cross table analysis?

The distribution file, generated from Column Analysis, contains distinct values per column and is used as the basis for the Cross Table Analysis function. The Cross Table Analysis function compares values from one table to find other columns containing the same values. You can even compare columns across multiple databases. This is a bi-directional comparison process, where a “base” column is compared to a “paired” column. You can configure to what degree (based on a percentage) you consider a match and that decision determines what is displayed for review. Determining what all these “Matches” mean is the role of the reviewer. Based on the meaning of the data they must determine what is a potential foreign key, redundant data, coincidence, or a potential referential integrity issue.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-4

U n i t 9 C r o s s Ta b l e A n a l ys i s

Foreign Key analysis

• Occurs in the context of a project: 

Same table can be analyzed multiple times in different projects



Does not copy the original metadata down into the project

• Results in structural and relationship validation: 

Identifies logical FKs for non-relational data environments



Identifies referential integrity violations for defined or selected FKs

• User Review allows Data Analyst to: 

Drill down into the results of Key analysis and select a FK



Review FK referential integrity violations



Review common domain columns

Cross Table Analysis

© Copyright IBM Corporation 2016

Foreign Key analysis

Foreign keys are typically descriptive attributes in their parent table that is, they are not primary keys. These foreign keys provide an implied linkage between two records which are in separate tables. However, a primary key to foreign key relationship implies the presence of a primary key on one of the tables.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-5

U n i t 9 C r o s s Ta b l e A n a l ys i s

Relational integrity

• Measures actual data values to assess: 

Physical relational integrity − Foreign

Key validity

− Redundant



storage of data

Logical relational integrity (not defined formally)

PK

DFA

FK

ORD_HDR

PK

DFA

DFA

CUSTOMER

Foreign Key example How do these tables relate to one another? Cross Table Analysis

© Copyright IBM Corporation 2016

Referential integrity

Relational integrity refers to the column relationships between tables.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-6

U n i t 9 C r o s s Ta b l e A n a l ys i s

FK analysis: Initial steps

• Initiate FK Analysis Job: 

Select base table



Select task - Run Key and cross domain analysis

Cross Table Analysis

© Copyright IBM Corporation 2016

FK analysis: Initial steps

To begin the foreign key analysis process choose two tables. You cannot drill down below the table level that is, you cannot choose specific columns. If you choose two tables and the Run Foreign Key Analysis task remains grayed out, then neither table you selected contains a primary key.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-7

U n i t 9 C r o s s Ta b l e A n a l ys i s

FK analysis: Select pair table

• Select pair table • Click Add to Pair List • Submit job

Cross Table Analysis

© Copyright IBM Corporation 2016

FK analysis: Select pair table

To begin the foreign key analysis process choose two tables. You cannot drill down below the table level that is, you cannot choose specific columns. If you choose two tables and the Run Foreign Key Analysis task remains grayed out, then neither table you selected contains a primary key.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-8

U n i t 9 C r o s s Ta b l e A n a l ys i s

FK analysis: Review results

• Shows domain overlap

• Exceptions can be viewed

• Can accept

relationship as foreign key or redundant

Cross Table Analysis

© Copyright IBM Corporation 2016

FK analysis: Review results

The above screen capture shows the graphical view of data value overlap. The two balls have three areas; one that intersects and two that do not intersect. This graphic is supported by the grid on the left in the Analysis Details.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-9

U n i t 9 C r o s s Ta b l e A n a l ys i s

FK analysis: Review domain overlap exceptions View Frequency Values Shows a distinct value comparison between the two columns

Cross Table Analysis

© Copyright IBM Corporation 2016

FK analysis: Review domain overlap exceptions

You can use the frequency values created to find those values that are not in the common area between paired column and the base column. You can sort by doubleclicking the "Not Comm" column. This action will cause the red-flagged records to appear at the top.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-10

U n i t 9 C r o s s Ta b l e A n a l ys i s

Relational integrity: Can related data be linked?

Cross Table Analysis

© Copyright IBM Corporation 2016

Referential integrity: Can related data be linked?

Note that the relationship between these two tables is expressed as a measurement called Integrity %. Violations of this linkage are shown at the bottom of the screen.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-11

U n i t 9 C r o s s Ta b l e A n a l ys i s

Demonstration 1 Foreign key analysis

• Discover Foreign key relationship

Cross Table Analysis

© Copyright IBM Corporation 2016

Demonstration 1: Foreign key analysis

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-12

U n i t 9 C r o s s Ta b l e A n a l ys i s

Demonstration 1: Foreign key analysis Purpose: Discover foreign key relationships. Foreign keys relate a column on one table to a column defined as a primary key on a second table. Foreign keys (a column pair) establish relationships between tables. One of the columns is a primary key and the other column, on a different table, is a descriptive column. The values of the two columns should have a high degree of overlap.

Task 1. Foreign Key Analysis.

1. 2. 3. 4.

5.

6. 7.

In Foreign Key analysis, you look for a data redundancy relationship between two tables, one of which must have a Primary Key. Cross-domain analysis is much the same, but does not require that one of the tables have a Primary Key; in fact, cross-domain analysis can be run between columns in the same table. Launch Information Server Console, login as student/student. Open the Chemco project. From the pillar menu, click Investigate and then select Key and CrossDomain Analysis. Click the VENDOR and ITM_MSTR tables and then click the Run Key and Cross-Domain Analysis option in the Tasks pane. A screen will appear listing the tables; you can now designate which one is the base and which one is the pair. Remove ITM_MSTR from Base Tables and VENDOR from Paired Tables.

Click Next in the lower-right portion of the screen. Column pairings appear in a grid format. Click the Remove button under the Selected Pairs grid and then select Remove All. Selected Pairs will disappear.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-13

U n i t 9 C r o s s Ta b l e A n a l ys i s

8. 9.

In the Available Pairs grid, select the combination VENDNO and SUPPLIER. Click the Add to Selected button and click the Add option.

10. Click the Submit button and then select Submit in the lower-right corner to run a job. 11. After the job successfully completes, click the Close button which returns to the Key and Cross-Domain Analysis main window. 12. Select the VENDOR table and then click Open Cross-Domain Analysis from the Tasks pane. 13. Click the Apply button. A green flag will appear.

14. Information Analyzer flagged the SUPPLIER column on the ITM_MSTR table as a candidate for a Foreign Key. 15. Click the ITM_MSTR - SUPPLIER column to select it and then click the View Details button in the lower right portion of the screen. 16. Click the Analysis Details tab to view the overlap. The graphic is shown on the right portion of the screen. Note that one value on the Base table (VENDOR) is not contained in the paired table (ITM_MSTR). Does this make business sense? This question can be answered by a subject matter expert. Also note that all values on the ITM_MSTR table are contained in the VENDOR table. 17. Click the Paired Column Status button on the lower-right and click the Mark as Foreign Key option.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-14

U n i t 9 C r o s s Ta b l e A n a l ys i s

18. To find the value that is present on the VENDOR table that is not on the ITM_MSTR table, return to the Cross-Domain screen and click the View Details button. 19. With the Frequency Values tab selected, scroll down to find the values with a red flag. This is the value that is present on the VENDOR table but missing on the ITM_MSTR table. If you see more values with a Count = 0, recall that you added records to the VENDOR table that are not present in the real data. Therefore, these values did not count in the summary statistics. These rows may be in a different order, but that is acceptable.

Foreign Key relationships are a special case of cross-domain analysis; that is, one of the columns in the pair must be a Primary Key to its table. 20. In a similar fashion, search for Foreign Key relationships on the following tables. • ORDR_HDR -- ORD_DTL • CUSTOMER -- ORDR_HDR 21. Click Close and then close the Chemco project but leave Information Server Console open for the next demonstration. Results: You discovered foreign key relationships.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-15

U n i t 9 C r o s s Ta b l e A n a l ys i s

Cross domain analysis

• Same as Foreign Key analysis, but neither column is a Primary Key • Prerequisite 

In the context of a project

• Invoking Cross Domain: 

From the Investigate Pillar menu, choose the Key and cross domain Analysis Task



Select table



Select base column



Add column to compare – called the pair column



Run analysis job



Review results



Mark as redundant

Cross Table Analysis

© Copyright IBM Corporation 2016

Cross domain analysis

Cross domain analysis requires a project context and can be initiated from the Investigate pillar menu. It measures the degree of value overlap between two columns.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-16

U n i t 9 C r o s s Ta b l e A n a l ys i s

View analysis details for cross domain

• Shows a detailed

comparison of the paired column to the base column

• Can view

comparison of domain values (frequency distributions)

• Can mark the paired

column as redundant with the base column

Cross Table Analysis

© Copyright IBM Corporation 2016

View analysis details for cross domain

This graphic shows a complete overlap in values for the two columns. Note that two views are present - analysis details and frequency values.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-17

U n i t 9 C r o s s Ta b l e A n a l ys i s

View frequency values

• Reviewing Cross

Domain Analysis, View Analysis Details 

Shows a distinct value comparison of the base column frequency distribution and the paired column frequency distribution

Cross Table Analysis

© Copyright IBM Corporation 2016

View frequency values

Note that the Frequency Values view shows not only the column pairings but also the record counts.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-18

U n i t 9 C r o s s Ta b l e A n a l ys i s

Cross-Table integrity review

• Cross-Field Analysis: 



Data is reviewed on a relational basis to assess the integrity of specific data interactions including: − Conformity

of fields to common data types, patterns, and formats

− Conformity

of field combinations to expected data values within a single file

Weed out false overlaps: − Indicators − Codes

will overlap but only focus on those with same name

will likely overlap – assess name and understanding of data

− Dates

and Quantities – focus on instances where data was expected to move from table-to-table

Cross Table Analysis

© Copyright IBM Corporation 2016

Cross-Table integrity review

Cross-field analysis can be used within a table or between two tables.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-19

U n i t 9 C r o s s Ta b l e A n a l ys i s

Cross-Table data redundancy

• Redundant data: 

Considered as the same data stored in multiple places − Look



for high overlap (98%+) and similar frequency

Usually seen when data fields are replicated in multiple tables in the same system − Separate

tables for similar data (e.g. PARTTBL table and ITM_MSTR tables – both carry PARTSPEC information



Might represent same data carried along in a process from system to system, stored over and over



Opportunity to reduce data storage and processing costs

• Mark columns as Redundant • Document and Report condition

Cross Table Analysis

© Copyright IBM Corporation 2016

Cross-Table data redundancy

Data redundancy is frequently the result of database performance efforts.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-20

U n i t 9 C r o s s Ta b l e A n a l ys i s

Cross-Table data references

• Reference data: 

Same data (usually Codes) stored in multiple places might represent reference data versus specific instances − Look



for high overlap, but reference data will have occurrence of 1 for each value

Look for instance values not in reference data: − Likely

indicates a domain not validated – review Domain Integrity for the noted

− Might

indicate reference data not maintained

field

• Document and Report conditions

Cross Table Analysis

© Copyright IBM Corporation 2016

Cross-Table data references

Reference data, such as found in look-up tables, is an example of planned data redundancy.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-21

U n i t 9 C r o s s Ta b l e A n a l ys i s

Checkpoint 1. True or False? Foreign Key Analysis requires that the base column is already identified as a Primary Key. 2. True or False? Cross domain analysis does not require that one of the columns is a Foreign Key. 3. True or False? Information Analyzer can build a list of compatible columns for cross domain analysis.

Cross Table Analysis

© Copyright IBM Corporation 2016

Checkpoint

Answer the checkpoint questions to test your mastery of the material presented.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-22

U n i t 9 C r o s s Ta b l e A n a l ys i s

Checkpoint solutions 1. True or False? Foreign Key Analysis requires that the base column is already identified as a Primary Key. True 2. True or False? Cross domain analysis does not require that one of the columns is a Foreign Key. True 3. True or False? Information Analyzer can build a list of compatible columns for cross domain analysis. True

Cross Table Analysis

© Copyright IBM Corporation 2016

Checkpoint solutions

Checkpoint questions and their solutions are provided on the slide.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-23

U n i t 9 C r o s s Ta b l e A n a l ys i s

Demonstration 2 Cross domain analysis

• Run cross domain analysis to discover data redundancy

Cross Table Analysis

© Copyright IBM Corporation 2016

Demonstration 2: Cross domain analysis

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-24

U n i t 9 C r o s s Ta b l e A n a l ys i s

Demonstration 2: Cross domain analysis Purpose: Discover data redundancy between ITM_MSTR and PARTTBL tables. Sometimes data redundancy is planned and sometimes it is accidental. Most third-normal form data models try to eliminate data redundancy as much as possible. In any case, it is important for the Data Analyst to understand where and why the redundancy occurs.

Task 1. Conduct Common Domain Analysis.

1. 2. 3. 4.

Cross-domain analysis differs from Foreign Key analysis in the prerequisites. Both analysis types search for two columns containing redundant data – within the threshold parameters of your project – Foreign Key analysis requires that one of the columns is a Primary Key, cross-domain analysis does not have that requirement. Notice that a column named VENDSPEC appears on both the PARTTBL and ITM_MSTR tables. Is this a redundant column? Open the Chemco project. Click the Key and Cross-Domain Analysis from the INVESTIGATE pillar menu. Select the ITM_MSTR and PARTTBL tables and then select Run Key and Cross-Domain Analysis from the Tasks pane. Remove the PARTTBL table from the Base Tables grid and the ITM_MSTR table from the Pair Tables grid.

5. 6.

Click Next. Remove all pairings and then add the VENDSPEC pair back.

7.

Click Submit and then click Submit.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-25

U n i t 9 C r o s s Ta b l e A n a l ys i s

After the job finishes, click the Close button which returns you to the Key and Cross-Domain Analysis tab, select the ITM_MSTR table, and then select the Open Cross-Domain Analysis option from the Tasks list. 9. Click the VENDSPEC column and then click the Apply button. 10. Click the VENDSPEC column, and then click the View Details button in the lower-right portion of the screen. 11. Click the Analysis Details tab.

8.

Note the domains overlap, but not completely. 12. Click the Paired Column Status button and then Mark as Redundant. 13. Click Close, click Close again and then close Information Server Console. Results: You discovered data redundancy between the ITM_MSTR and PARTTBL tables.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-26

U n i t 9 C r o s s Ta b l e A n a l ys i s

Unit summary

• Perform foreign key analysis • Perform cross-domain analysis

Cross Table Analysis

© Copyright IBM Corporation 2016

Unit summary

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-27

U n i t 9 C r o s s Ta b l e A n a l ys i s

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

9-28

Unit 10

Baseline analysis

Baseline analysis

Information Analyzer v11.5 © Copyright IBM Corporation 2016 Course materials may not be reproduced in whole or in part without the written permission of IBM.

U n i t 1 0 B a s e l i n e a n a l ys i s

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

10-2

U n i t 1 0 B a s e l i n e a n a l ys i s

Unit objectives

• Perform Baseline Analysis

Baseline analysis

© Copyright IBM Corporation 2016

Unit objectives

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

10-3

U n i t 1 0 B a s e l i n e a n a l ys i s

Baseline analysis: Understanding the business problem

• Baseline Analysis checks the: 

Structure



relationships



integrity of data environments

• Between two points in time • By identifying specific changes in defined and inferred structure and in data content.

Baseline analysis

© Copyright IBM Corporation 2016

Baseline analysis: Understanding the business problem

The baseline analysis creates a checkpoint of analysis results. This is stored separately from the current analysis. Therefore, it is possible to create analysis results, then baseline those results, and then repeat the data analysis against the same data. This will create a current analysis that differs from the baseline.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

10-4

U n i t 1 0 B a s e l i n e a n a l ys i s

Overview

• Allows the same table to be evaluated at multiple points in time. • Comparison is always between a set Baseline version AND either a set Checkpoint version or the current state of analysis

• Baseline Analysis results in structure and content validation: 

Provides a summary view for the table



Identifies differences in structure and content at the column level

• User Review: 

Allows the user to review the summary variations for the table



Allows the user to review the detail variations for each of the columns



Users can go to Column Analysis to view current state of analysis

Baseline analysis

© Copyright IBM Corporation 2016

Overview

A table baseline provides a comparison point for later analyses.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

10-5

U n i t 1 0 B a s e l i n e a n a l ys i s

Starting baseline analysis

Prerequisites Prerequisites In the context of a project column analysis is complete

Baseline analysis

Invoking Baseline Invoking BaselineAnalysis Analysis From the Investigate Pillar menu choose the Baseline Analysis Task

© Copyright IBM Corporation 2016

Starting baseline analysis

Baseline analysis is initiated from the Investigate pillar menu. Only tables can be selected; that is, Individual columns cannot be chosen.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

10-6

U n i t 1 0 B a s e l i n e a n a l ys i s

Setting the baseline (1 of 2)

• Setting a Baseline: 

Set Baseline function will process immediately and indicate when baseline is complete for selected tables



Click Close to return to Picker

Baseline analysis

© Copyright IBM Corporation 2016

Setting the baseline

You may either set a baseline or a checkpoint.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

10-7

U n i t 1 0 B a s e l i n e a n a l ys i s

Setting the baseline (2 of 2)

• Setting a Baseline 

Once complete, Baseline Date is displayed for table

• Note: Similar process can be run to Set Checkpoint

Baseline analysis

© Copyright IBM Corporation 2016

Setting a checkpoint is a very similar process to setting a baseline. Subsequently, a baseline can be compared to either the checkpoint or to a current analysis.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

10-8

U n i t 1 0 B a s e l i n e a n a l ys i s

View the baseline analysis (1 of 2)

• Viewing Baseline Analysis 

From the Picker, select a table (or tables) and choose View Baseline Analysis

• Change values in column analysis

Baseline analysis

© Copyright IBM Corporation 2016

View the baseline analysis

The data analyst returned to column analysis and changed the following for the field CHECKDT: • Changed the data class from Text to Date. • Changed the null to invalid and spaces to default.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

10-9

U n i t 1 0 B a s e l i n e a n a l ys i s

View the baseline analysis (2 of 2)

• Viewing Baseline Analysis: 

Choose the comparison point - an established Checkpoint or the current analysis



Click OK

Baseline analysis

© Copyright IBM Corporation 2016

Baselines can be viewed and that view provides the basis for comparisons to the current analysis.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

10-10

U n i t 1 0 B a s e l i n e a n a l ys i s

View the baseline analysis summary

• Baseline Summary: 

Summarized results for the table level



Results include the number of columns where differences occur and flags indicating potential disparities

Baseline analysis

© Copyright IBM Corporation 2016

View the baseline analysis summary

Information Analyzer places red flags as eye catchers. These are points where the current analysis differs from the baseline analysis.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

10-11

U n i t 1 0 B a s e l i n e a n a l ys i s

View the baseline analysis differences

• Baseline Differences: 

Detailed results for the column level



Results include the column level summaries of distinctions for both Structure (Defined and Inferred) and Content

Baseline analysis

© Copyright IBM Corporation 2016

View the baseline analysis differences

By examining the baseline differences view, the data analyst can determine whether changes were made in either structure or content or both.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

10-12

U n i t 1 0 B a s e l i n e a n a l ys i s

Checkpoint 1. True or False? Objective of Baseline Analysis is to understand how data structures are changing over time. 2. True or False? Baseline Analysis is done one table at a time. 3. True or False? To view changes in current data analysis versus Baseline, you must rerun the Baseline Analysis against new data values.

Baseline analysis

© Copyright IBM Corporation 2016

Checkpoint

Answer the checkpoint questions to test your mastery of the material presented.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

10-13

U n i t 1 0 B a s e l i n e a n a l ys i s

Checkpoint solutions 1. True or False? Objective of Baseline Analysis is to understand how data structures are changing over time. True 2. True or False? Baseline Analysis is done one table at a time. False 3. True or False? To view changes in current data analysis versus Baseline, you must rerun the Baseline Analysis against new data values. False

Baseline analysis

© Copyright IBM Corporation 2016

Checkpoint solutions

Answer the questions to test your mastery of the material presented.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

10-14

U n i t 1 0 B a s e l i n e a n a l ys i s

Demonstration 1 Baseline analysis

• Create baseline and review differences between baseline and current analysis

Baseline analysis

© Copyright IBM Corporation 2016

Demonstration 1: Baseline analysis

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

10-15

U n i t 1 0 B a s e l i n e a n a l ys i s

Demonstration 1: Baseline analysis Purpose: This demonstration shows how to run a baseline analysis and use it to compare against a new current analysis. You can use IA to detect changes in data structure and content over time. To fully understand how this is done you need to first save your current analysis and freeze it in time by making it a baseline.

Task 1. Set Baseline and View Baseline Analysis. 1. 2. 3. 4. 5. 6. 7.

8. 9. 10. 11.

12. 13. 14.

Launch Information Server Console, login as student/student. Open the Chemco project. Use the pillar INVESTIGATE menu to select Baseline Analysis. Open the chemcoseq data source tree and select the VENDOR table. Click the Set Baseline task under the Tasks pane and then click OK. Set the baseline for the Current Analysis. A Set Baseline screen will appear. Click the Close button. Your baseline for the VENDOR table is now stored. You will next return to column analysis and make a change to the VENDORCODE column; this will not change the baseline, but will change the current analysis. Click Column Analysis under the INVESTIGATE pillar and expand the rows down to the table level. Click the VENDORCODE field of the VENDOR table and then click Open Column Analysis in the Tasks pane. Click the View Details button and then click the Domain & Completeness tab. Change the Status field of the [NULL] data value to valid (you set this to invalid in an earlier demonstration). If you did not set this in an earlier demonstration then change the validity to invalid. The objective is to make a change to a table AFTER you set a baseline. Save your change. Return to the Baseline Analysis tab. Select the VENDOR table and then click the View Baseline Analysis task. The Pick Analysis Summary window will appear.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

10-16

U n i t 1 0 B a s e l i n e a n a l ys i s

15. Click the Current Analysis radio button, and then click OK. A summary screen of red-flagged differences will be displayed.

16. Click the Baseline Differences from the Title options in the left corner of the window. 17. Ensure that the VENDORCODE column is selected. 18. Under the Differences - Structure tab, note there are no differences. 19. Click the Content tab. Differences are flagged in red for Invalid and % Invalid.

20. Click Close and then close Information Server Console. Results: This demonstration showed how to run a baseline analysis and use it to compare against a new current analysis.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

10-17

U n i t 1 0 B a s e l i n e a n a l ys i s

Unit summary

• Performed Baseline Analysis

Baseline analysis

© Copyright IBM Corporation 2016

Unit summary

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

10-18

Unit 11

Reporting and publishing results

Reporting and publishing results

Information Analyzer v11.5 © Copyright IBM Corporation 2016 Course materials may not be reproduced in whole or in part without the written permission of IBM.

Unit 11 Reporting and publishing results

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-2

Unit 11 Reporting and publishing results

Unit objectives

• Produce Information Analyzer data reports • Publish Information Analyzer analyses

Reporting and publishing results

© Copyright IBM Corporation 2016

Unit objectives

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-3

Unit 11 Reporting and publishing results

Communicating the analysis results

• Reporting addresses data: 

Structure



Content



Relationships

• Publishing results in summary of the analysis review available to

developers using other components of the Information Server such as: 

DataStage



QualityStage



Information Governance Catalog.

Reporting and publishing results

© Copyright IBM Corporation 2016

Communicating the analysis results

Reporting generates the reports from analysis on structure, content, relationships or integrity of data that can be distributed to a wider audience. Publishing Analysis makes the summary of the analysis review, including annotations, directly available to developers or data stewards using other components of the Information Server such as DataStage, QualityStage, or Information Governance Catalog.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-4

Unit 11 Reporting and publishing results

Reporting

• Prerequisite 

Relevant analysis for specific reports are complete.

• Reports 

Select Reports from Home Pillar menu to generate reports from any of the standard out-of-the-box reporting templates.

Reporting and publishing results

© Copyright IBM Corporation 2016

Reporting

Reports are initiated from the pillar menu and do not require an open project. However, your relevant analyses should be completed before publishing related reports.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-5

Unit 11 Reporting and publishing results

Reporting: Selecting report types

Reports: – Report Templates are grouped by Category and Sub-Category – Information Analyzer has 10 Sub-categories of report templates

Reporting and publishing results

© Copyright IBM Corporation 2016

Reporting: Selecting report types

Report templates are divided into various categories by product – Information Analyzer, DataStage. Within each product category are subcategories.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-6

Unit 11 Reporting and publishing results

Reporting: Report model Report

Report Output

Report template

Report

Report templates:

Reports:

Report Results:

• Has report creation parameters • Has report runtime parameters • Defined for each products • Users cannot define their own templates • Share the same graphical template

• Has report runtime parameters • Can be scheduled • Can be run once • Format • History (replace, keep, expiration) • Access rights

• HTML/PDF/Word RTF/XML • Can be added to a favorite folder

Reporting and publishing results

© Copyright IBM Corporation 2016

Reporting: Report model

Report templates are used to create reports, which in turn can be run and rerun. Each report generates results they can take a variety of formats. Individual report runs can be saved and viewed subsequently.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-7

Unit 11 Reporting and publishing results

Reports

• Can be saved and these saved definitions can also be viewed directly. • Create new report: 

Select the Report Types tab.



Expand a Report Type category and select a Report Template.



Select New Report task.

Reporting and publishing results

© Copyright IBM Corporation 2016

Reports

Reports can be saved to folders within Information Server and those reports can be viewed from within Information Server. This slide outlines the way to create a new report. Reports that already exist can be viewed by clicking the Saved Reports tab.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-8

Unit 11 Reporting and publishing results

Reporting: Creating new reports

• Step 1: Select Sources: 

Expand the data source tree to the desired tables or columns.



Check the tables or columns.



Click Next button.

• Step 2: Specify Report Parameters: 

Change report specific parameters as desired.



For example, enter your own Report Description.



Click Next button.

• Step 3: Specify Name and Output: 

Enter a Report Name.



Select Output Format (for example, PDF, HTML, …)



Select whether to Save Report, Add as Favorite, or View when Complete.



Click Finish button.

Reporting and publishing results

© Copyright IBM Corporation 2016

Reporting: Creating new reports

Three steps are required to create a report - these steps are documented on the left portion of the screen so that the data analyst can keep track of where they are in the report generation process.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-9

Unit 11 Reporting and publishing results

Reporting: Running reports Run Reports: – A Saved Report can be Run. – Select the Report to Execute. – Click Run task. – Activity Status for the Report can be checked while in progress.

Reporting and publishing results

© Copyright IBM Corporation 2016

Reports: Running reports

Reports that have already been created can be rerun; previous report runs can be viewed. In addition, report runs can be versioned and a report history maintained.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-10

Unit 11 Reporting and publishing results

Reporting: Viewing reports

• View Reports: 

An Executed Report can be Viewed



Check status under Last Run date



Select the Report to View



Click View task



Report will open based on their output format (for example, HTML, PDF, and so on).



Multiple pages of report are scrollable through standard browser interface.



Through browser, reports can be saved or emailed.



Note: Executed reports are stored in repository and are delivered to browser.

Reporting and publishing results

© Copyright IBM Corporation 2016

Reporting: Viewing reports

Reports are maintained in the XMETA database (internal database for Information Server). Consequently, they can be viewed by multiple people. Since the reports are In the XMETA database you’ll need to export them using tools available in your browser.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-11

Unit 11 Reporting and publishing results

Reporting: View reports by date

• View Reports by Date: 

Reports can be browsed by date of execution.



Click View by Date task.



Select Start and/or End Date for view.



Click Apply button to change date range.



From this view, you can choose to View or Run that specific report

Reporting and publishing results

© Copyright IBM Corporation 2016

Reporting: View reports by date

Reports can be browsed by date. This slide outlines the steps to perform this function.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-12

Unit 11 Reporting and publishing results

Demonstration 1 Reporting

• Create column summary report

Reporting and publishing results

© Copyright IBM Corporation 2016

Demonstration 1: Reporting

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-13

Unit 11 Reporting and publishing results

Demonstration 1: Reporting Purpose: This demonstration shows how to use IA functions to produce professionallooking data reports. Reports are the primary way that a Data Analyst communicates with project team members and end users. IA provides an easy way to produce those reports.

Task 1. Creating a Report. Launch Information Server Console, login as student/student and then open the Chemco project. 2. Open Home and then select Reports from the pillar menu. 3. Click the Report Types tab and expand Information Analyzer - Column Domain. 4. Select the Completeness and Validity Summary report template and then click New Report under the Tasks pane. 5. Expand the Chemco tree until you see the individual tables. 6. Click the VENDOR.txt checkbox. This will create a report for all columns in the VENDOR table. 7. Click the Next button. 8. Enter any report comments, select Locale Specification to English(US) and then click Next. The next screen will contain an Output Format drop-down box. 9. Select the HTML option. 10. Check the Save Report checkbox. 11. Click the Finish button. You will then return to the Report templates screen. A job will be submitted and when the report is finished it will display on your screen. If prompted for a user ID and password enter student/student. 12. Click the Saved Reports tab and locate your new report. To create another report, you can click the Run option under the Tasks pane. Another job will be submitted and another report produced. (Presumably, you would have waited until more analysis had been completed on the VENDOR table before running another report). 1.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-14

Unit 11 Reporting and publishing results

Task 2. Create Chemco Project Reports.

1.

2. 3.

Recall from the Project Scenario exercise that you should focus on the CUSTOMER, CARRIER, VENDOR, ORD_HDR, ORD_DTL, ITM_MSTR, and UNITCTLG tables. Create and save the following reports for the designated tables - use the Display Notes option and select HTML for the output: •

Data Classifications - found in the Column Classification category.

•

Domain Analysis Detail - found in the Column Domain category.

Note that you have numerous options in the parameters portion of this report. Run and view each report created. Close the Chemco project but leave Information Server Console open for the next demonstration.

Results: This demonstration showed how to use IA functions to produce professionallooking data reports.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-15

Unit 11 Reporting and publishing results

Publish analysis results

• Prerequisite: 

All Reviews are complete



In context of project

• Publish Analysis Results. 

Select Publish Analysis Results from Pillar menu to share the data that you have gathered about your data with the rest of the suite.

• Select data object. • Select Publish Analysis Results from the Tasks list. • You can choose to publish Current Analysis, a Checkpoint or a Baseline. You can also include notes here as well.

Reporting and publishing results

© Copyright IBM Corporation 2016

Publish analysis results

Publishing analysis results must be performed in the context of a project and is initiated from the Investigate pillar menu. Select the data object and then select the option to publish analysis results. You’ll be given the option to publish using current analysis, a checkpoint, or a baseline. Very importantly, you can publish with the option to include notes.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-16

Unit 11 Reporting and publishing results

View published results from DataStage

• Prerequisite 

Data has been published

• To use the published results… 

Within DataStage, select Repository/Metadata Sharing/Management from the menu bar

Reporting and publishing results

© Copyright IBM Corporation 2016

View published results from DataStage

To import the published analysis results you should log onto DataStage Designer. Next you should select Repository/Metadata sharing/Management from the menu bar.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-17

Unit 11 Reporting and publishing results

Create DataStage table definition

• Published Analysis Results: 

Drill down through the structure until you find what you want to work with, and then click Create Table Definition from shared Table.



The tables are now created in the chosen category.

Reporting and publishing results

© Copyright IBM Corporation 2016

Create DataStage table definition

Use the function Repository > Metadata Sharing > Create Table from Shared Table. After this has been completed the table definitions will be created in the chosen category.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-18

Unit 11 Reporting and publishing results

View published information: Table level

Published Analysis Results

• Includes analysis results and table level notes

Reporting and publishing results

© Copyright IBM Corporation 2016

View published information: Table level

Analytical results are at two levels: • Table • Column To view the table level analysis results and notes, double-click the table definition and then click the Analytical Information tab. Note that analysis results can be found in the Summary pane and notes can be viewed in the Notes pane. Notes include heading, status, and action as well as detailed comments. Use the scroll bar located at the right to view all relevant analysis results. Columns that have been designated as primary keys will simply be referred to in the analytical information tab; they will not be designated as primary keys in the table’s column list.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-19

Unit 11 Reporting and publishing results

View published information: Column level

Published Analysis Results

• Includes analysis results and column level notes

Reporting and publishing results

© Copyright IBM Corporation 2016

View published information: Column level

Column level notes can be viewed in a column’s extended editor. This editor also contains an analytical information tab. Results of Information Analyzer analysis are viewed in the Summary pane and attached notes are viewed in the Notes pane.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-20

Unit 11 Reporting and publishing results

Exporting DDL

Submit job after preview Reporting and publishing results

© Copyright IBM Corporation 2016

Exporting DDL

Data Definition Language (DDL) is used by database systems to create table structures. The DDL exported by Information Analyzer can be migrated to a database management system for further processing.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-21

Unit 11 Reporting and publishing results

Export a reference table

• Prerequisite 

All Reviews are complete

• Select Table Management from Pillar menu • Select the Reference table name that you created in Column Analysis, and then select Open from the Tasks menu

• Click the Export button to create the file

Reporting and publishing results

© Copyright IBM Corporation 2016

Export a reference table

During column analysis you created numerous reference tables. These reference tables were created In the Information Analyzer repository. To use these reference tables it is necessary to export the data to a location that can be accessed by the ETL developer.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-22

Unit 11 Reporting and publishing results

Checkpoint 1.True or False? Reports are generated using a template. 2.True or False? Report runs can be scheduled. 3.True or False? Publishing can include Notes.

Reporting and publishing results

© Copyright IBM Corporation 2016

Checkpoint

Answer the checkpoint questions to test your mastery of the material presented in this unit.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-23

Unit 11 Reporting and publishing results

Checkpoint solutions 1. True or False? Reports are generated using a template. True 2. True or False? Report runs can be scheduled. True 3. True or False? Publishing can include Notes. True

Reporting and publishing results

© Copyright IBM Corporation 2016

Checkpoint solutions

Answers to the checkpoint questions are provided on the slide.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-24

Unit 11 Reporting and publishing results

Demonstration 2 Publishing results

• Publish analysis results and import table definition into DataStage

Reporting and publishing results

© Copyright IBM Corporation 2016

Demonstration 2: Publishing results

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-25

Unit 11 Reporting and publishing results

Demonstration 2: Publishing results Purpose: This demonstration shows how to publish analysis results in such a way that other Information Server components can view some of the results. The results of your IA analysis are important to ETL developers; they need to understand the metadata findings you have assembled. IA provides a good way to get this information to developers and this includes your data notes.

Task 1. Publish Analysis Summary. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

Open project Chemco. Select Publish Analysis Results from the Investigate Pillar menu. Select the VENDOR table. Click the Publish Analysis Results task under the Tasks pane. Click the Current Analysis radio button and the Include Notes checkbox. Click OK. Click the View Analysis Results task. Review the details. Click the Publish Results Summary button. Select the Current Analysis option, Include Notes, and then click OK.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-26

Unit 11 Reporting and publishing results

11. Log onto DataStage Designer by double-clicking the Designer Client icon, login as student/student and leave the default DataStage project selected in the Project list.

12. Click Cancel to close the New dialog box (if necessary). 13. Click the Repository menu. 14. Select Metadata Sharing and then select Create Table Definition from shared Table. 15. Select Yes if prompted to overwrite.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-27

Unit 11 Reporting and publishing results

16. Drill down to the VENDOR table, select it, and then click the Create button. You can see all of the Chemco tables - even ones that were not published. Recall that the metadata import process in one Information Server component can make that metadata available to other Information Server components. But only published tables will have analysis results attached (you can't see the results from this screen.). 17. Locate the VENDOR table in the repository tree – it will appear under the Table Definitions branch.

18. Double-click the VENDOR table. 19. Click the Analytical Information tab. Notes are contained in the lower pane – these are the table level notes. 20. Click the Columns tab. 21. Double-click the number just to the left of VENDORCODE (2 in this example). This will place you in the extended metadata editor for the VENDORCODE column. 22. Click the Analytical Information tab. Analysis results for the VENDOR table are now available to ETL developers from the DataStage environment.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-28

Unit 11 Reporting and publishing results

23. Click Close, click OK, and then select No if prompted to save the table definition. 24. Close InfoSphere DataStage and QualityStage Designer. 25. In Information Server Console, close the Chemco project. Results: This demonstration showed how to publish analysis results in such a way that other Information Server components can view some of the results.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-29

Unit 11 Reporting and publishing results

Demonstration 3 Export reference tables

• Export mapping and validity reference tables

Reporting and publishing results

© Copyright IBM Corporation 2016

Demonstration 3: Export reference tables

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-30

Unit 11 Reporting and publishing results

Demonstration 3: Exporting reference tables Purpose: This demonstration shows how to export reference tables from IA to an area that can be accessed by other team members – particularly ETL developers. Most tables in IA are contained in the IADB; this includes any reference tables you created. Often these reference tables are useful to ETL developers. IA provides a mechanism to export these reference tables out of the IADB and places the data in a location that ETL developers can access.

Task 1. Export a reference table for invalid values. In an earlier demonstration, you created two reference tables - CUSTIDMAP and VALCREDCODE. These tables were created in the IADB and should now be exported to disk where ETL developers can easily access them. 1. Open the Chemco project. 2. Using the Pillar menu, click Investigate and then select Table Management. 3. Locate your reference tables. Information Analyzer creates numerous tables during column analysis. Many of these tables are useful and can be exported for ETL developers. Although the actual names of many Information Analyzer tables are somewhat mysterious, you can identify which columns and tables were analyzed to produce the results. In this case, however, you are only currently interested in the CUSTIDMAP and VALCREDCODE tables. 4. Select the CUSTIDMAP table and click the Open option in the Tasks pane. 5. Click the Export button located in the lower-right portion of your screen. 6. Use the Browse button to navigate to a folder on your hard drive, name the file CUSTIDMAP, and select the csv file type. Select the delimited file option with | delimiter, ensure the Include Column Headers checkbox is selected, and then click the OK button. 7. Click OK to complete the export. 8. Click Close. 9. Use this method to also export the VALCREDCODE table. 10. Click Close and close all open windows. Results: This demonstration showed how to export reference tables from IA to an area that can be accessed by other team members – particularly ETL developers.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-31

Unit 11 Reporting and publishing results

Unit summary

• Produce Information Analyzer data reports • Publish Information Analyzer analyses

Reporting and publishing results

© Copyright IBM Corporation 2016

Unit summary

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

11-32

Data rules and metrics

Data rules and metrics

Information Analyzer v11.5 © Copyright IBM Corporation 2016 Course materials may not be reproduced in whole or in part without the written permission of IBM.

This unit covers data rules and metrics in Information Analyzer.

Unit 12 Data rules and metrics

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-2

Unit 12 Data rules and metrics

Unit objectives

• Build and test data rules • Build data metrics

Data rules and metrics

© Copyright IBM Corporation 2016

Unit objectives

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-3

Unit 12 Data rules and metrics

Overview: Data Rules, Rule Sets, Metrics Information Analyzer

Data rules and metrics

© Copyright IBM Corporation 2016

Overview: Data Rules, Rule Sets, Metrics - Information Analyzer

You will begin by reviewing an overview of the topics covered in this unit.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-4

Unit 12 Data rules and metrics

What is a data rule?

• Rules are DATA rules, not Business Rules, Process Rules, and so on • These rules are specifically designed to perform operations against DATA. (Business Rules are more targeted for process operations)

• These are technical rule definitions, used to evaluate data environments

• Three basic types of DATA rules: 

Data Transformation Rules: − Simple

rules (ColA = ColB)

− Advanced



Rules (ColA = UPCASE(ColB))

Data Quality Rules − Assertion

Rules (IF ColA is NULL THEN FAIL)

Data rules and metrics

© Copyright IBM Corporation 2016

What is a data rule?

A data rule is a rule that addresses data conditions; data rules are neither business rules nor process rules. Data rules will be used to evaluate and transform data environments. Three examples are given on the slide.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-5

Unit 12 Data rules and metrics

Some guiding concepts

• Logical data rule definition  can build without requiring physical data knowledge

• Reusable  can be applied to multiple data sources • Shared  can be utilized/referenced by other Information Server components (for example, Information Governance Catalog, FastTrack)

• • • • •

Quickly evaluated  can be tested interactively Flexible output  can look at statistics and exceptions Historical  can capture and retain execution results over time Organized  can place within relevant categories/folders Deployable  can transfer the rule to another environment (for example, production)

• Auditable  can identify who modified a rule and when Data rules and metrics

© Copyright IBM Corporation 2016

Some guiding concepts

This slide documents some of the guidelines that will be used to create data rules, specifically: • Create logical data rule definitions that can be used to build data rules without any knowledge of the physical data structure • Data rules that are reusable in that they can be applied to multiple data sources • Data rules that can be shared; that is, they can be utilized by other Information Server components such as Information Governance Catalog and FastTrack • Ability to quickly evaluate the rules by testing them interactively • Rules that provide output in the form of statistics and exceptions • Ability to capture and retain execution results over time to give a historical perspective • Ability to organize these rules into relevant categories and folders. The rules should also be deployable in that we can transfer the rule to another environment • All processes to be auditable in that we can identify who modified the rule and when

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-6

Unit 12 Data rules and metrics

Components

• Information Analyzer uses the internal concept of a Quality Component • Component Types include: 

Rule definition - logical representation of a Data Rule



Executable rule - physical representation of a Data Rule



Metric - key indicator that represents a standard evaluation value or calculation

• All Quality Components support the ability to be secured through the use of Access Control Lists (ACLs)

• All Quality Components provide the ability to capture execution histories as well as Audit Histories

Data rules and metrics

© Copyright IBM Corporation 2016

Components

Data rules, as an Information Analyzer concept, have several physical representations. One representation is that of a rule definition which a logical representation of the data rule is. Another representation is an executable rule which can run against real data. The third representation is a metric which represents a standard evaluation value or calculation. All of these data rule representations can be thought of as quality components that belong to Information Analyzer.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-7

Unit 12 Data rules and metrics

Organized by category

• Key concept within Information Analyzer project is ability to categorize information into multiple folders

• Provides mechanism to relate common artifacts into common collections

• Items can belong to any category and can belong to multiple categories.

• Categories are hierarchical

Data rules and metrics

© Copyright IBM Corporation 2016

Organized by category

Data rules can be organized into categories. These categories provide a mechanism to relate data collections to their associated rules. A rule can belong to any category and can also belong to multiple categories. Additionally, categories are hierarchical.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-8

Unit 12 Data rules and metrics

Category view

Data rules and metrics

© Copyright IBM Corporation 2016

Category view

This slide shows data quality components (data rules, metrics, etc.) organized into a set of categories. Some of the categories are hierarchical. Categories make a flexible framework whereby you can group objects that you think are related to one another.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-9

Unit 12 Data rules and metrics

Data rule definition: Abstract rules

• Represents the abstract concept of a rule • Can contain simple rules or complex rules • Can use: 

Terms from Information Governance Catalog



Functions

Data rules and metrics

© Copyright IBM Corporation 2016

Data rule definition: Abstract rules

You start the process by creating a data rule definition; this represents the abstract concept of a rule. Rules can be either simple or complex, can use terms from Information Governance Catalog, and can invoke functions.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-10

Unit 12 Data rules and metrics

Logical rules

• Are called data rule definitions • Represent a logical expression • Include source or reference columns, but those columns can be logical or physical: 

Physical sources: Standard metadata imported via IA etc.



Logical sources: Data models, and so on, available in common model



Logical placeholders: User-created/defined words that will be mapped to one or many physical sources

• Are like templates and can be associated with one or many executable Data Rules

Data rules and metrics

© Copyright IBM Corporation 2016

Logical rules

Data rule definitions are the logical form of a rule. Although they include the source or reference columns, those columns can be either logical or physical. For physical columns you need to have already imported standard metadata into Information Analyzer. Logical representations of data can be in the form of data models or represented by place holders, which are user defined words that will be mapped eventually to a physical source of data. Logical rules are very much like templates that can later generate one or many executable data rules.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-11

Unit 12 Data rules and metrics

Executable rules

• Are called data rules • Data Rules represent an expression that can be processed/executed • Require that a data rule definition be bound to a specific physical data

source (for example, a source or reference column must be directly linked to one specific physical column within a specific table that has a data connection)

• Are the objects that are actually run by a user and will produce a specific output

Data rules and metrics

© Copyright IBM Corporation 2016

Executable rules

Executable rules are data rules that had been generated from a data rule definition and are bound to specific data. Data rules are the objects that are run by a user and that produce specific output.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-12

Unit 12 Data rules and metrics

Predefined rules • Are a large collection of rules, spanning industries, for perhaps the most common rule checks • Building blocks for those that want to get started quickly with Data Quality but don’t want to spend a lot of time learning how to define rules • Appear under 'Published Rules' in the Data Quality workspace • Hierarchically organized - drill down to the rule you want and copy to your project • Generate an executable and bind to the data you want to run the rule against

Data rules and metrics

© Copyright IBM Corporation 2016

Predefined rules

When Information Server 11.5 is installed (assuming you install Information Analyzer), a set of predefined data rule definitions are installed. These can be used as templates for making other data rule definitions. The predefined rules are organized into categories.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-13

Unit 12 Data rules and metrics

IBM supplied predefined rules

• Loaded at product installation (IA 9.1) • Can be loaded post installation (IA 8.7)

Data rules and metrics

© Copyright IBM Corporation 2016

IBM supplied predefined rules

These rules can be found in the project's published rules area and are divided into categories as shown on this graphic.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-14

Unit 12 Data rules and metrics

Benchmarks

• All quality components can establish a benchmark for evaluation • For rules, benchmarks can be established against: 

Base statistics: Number passed/failed.



Associative statistics: Composite evaluation of rule sets



Historical statistics: Evaluation against prior runs

Data rules and metrics

© Copyright IBM Corporation 2016

Benchmarks

Benchmarks are thresholds that are either met or not met. They are cutoffs that define whether or not the data quality being tested by the rule has been met. Benchmarks determine whether or not the rule run is flagged red or green.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-15

Unit 12 Data rules and metrics

Rule versus rule set

• A Rule is a single object that generates a single pass/fail type statistic • Rules generate counts of exceptions, details of exceptions, and userdefined output results

• A rule set is a collection of rules that are executed together as a single unit and generate several levels of statistics 

Rule sets generate: − Rule

level exceptions

− Record

level statistics (for example, how many rules did a specific record break)

• Rule sets expand beyond single rule context

Data rules and metrics

© Copyright IBM Corporation 2016

Rule versus rule set

Rules can be grouped together to form rule sets. The set can then be executed and it will in turn execute all the rules that belong to that rule set. In this way rule sets expand the concept of a single data rule.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-16

Unit 12 Data rules and metrics

Rules and rule set execution results

• Captured and persisted (can produce output tables) • • • •



Maintained in either internal or user-named tables



Columns to be captured are set by the user

Can be compared to historical runs Can be trended over time Can be graphically displayed via Dashboards and Charts Can be used in calculations to provide aggregate scoring (Metrics) or compared against standard threshold values (Benchmarks)

Data rules and metrics

© Copyright IBM Corporation 2016

Rules and rule set execution results

Results produced by rules and rule sets can be captured and persisted; the results are maintained in either internal or user-named tables and the columns to be captured are determined by the user. Results can be trended over time and compared to historical runs. They can also be graphically displayed via the Information Analyzer dashboard and charts. The results can also be used in calculations to provide an aggregate scoring, called metrics, or compared against standard threshold values known as benchmarks.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-17

Unit 12 Data rules and metrics

User-Named rule output tables - overview • Normally rules results tables get stored in the IADB with unique system-generated tables names; those tables are immediately available for use in ‘viewing results’ for rules. • User-named output tables provide for:  Creating a table with the name specified by you  Appending data to an existing user-defined table  Sharing of tables between rules (so multiple rules can update the same table)  Auto-import meta data and auto-registration of output tables so the output of one rule can easily be the input to another

Data rules and metrics

© Copyright IBM Corporation 2016

User-Named rule output tables - overview

Data rules can produce output tables. Normally these tables are named with a unique system-generated name; the contents of these tables are immediately available for viewing in the "View Results" tab. Another option available to the user is that of user-named output tables. The names of these tables are determined by the user and the user can also determine several other characteristics, such as appending data to an existing table. Tables defined in this way can also be set up to automatically import and register to the Information Analyzer project.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-18

Unit 12 Data rules and metrics

User-Named rule output tables - defining • Specified in the 'Output Table' section of a rule’s 'Bindings and Output' • Three Options   

System Table Only Simple User-Named Table Advanced User-Named Table

• By default 'System Table Only' This generates system-defined table names

Data rules and metrics

© Copyright IBM Corporation 2016

User-Named rule output tables - defining

This graphic shows the Information Analyzer location (Bindings and Output tab > Output Table) where user-name tables are determined. This option is specified on the data rule, not the data rule definition.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-19

Unit 12 Data rules and metrics

User-Named rule output tables - simple • A ‘Simple User-Named Table’ can be used be a single rule only. • ‘Simple’ user-named tables are views backed by system-generated output tables. • The user can: 

  

Provide the name of the table (which gets stored in the schema of the IADB user) Choose to recreate (overwrite) the table every time or to append. Indicate how many runs can be appended. Indicate if you want the output table auto-registered to the project so it can be fed into additional rules.

Data rules and metrics

© Copyright IBM Corporation 2016

User-Named rule output tables - simple

Simple user-named tables can only be specified by a single rule; if you try to define one on multiple rules Information Analyzer will give you an error message.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-20

Unit 12 Data rules and metrics

User-Named rule output tables - advanced • An 'Advanced User-Named Table' can be shared between multiple rules. • To be shared, the output schema for the various rules must match. • The user can: 

 

Provide the name of the table (which gets stored in the schema of the IADB user) Choose to recreate (overwrite) the table every time or to append. Indicate if you want the output table auto-registered to the project so it can be fed into additional rules.

Data rules and metrics

© Copyright IBM Corporation 2016

User-Named rule output tables - advanced

Advanced user-named tables can be used in multiple rules. Auto-registration will be discussed on later slides.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-21

Unit 12 Data rules and metrics

User-Named output tables - auto-registration • Auto-Registration (and Auto-Import) is the ability to automatically import the table’s meta data and register it to a project. • Can only be done when the IADB has been defined as a data source • Only used if you selected the 'Include in Project Sources' checkbox when defining the user-named table • This can be done either at the global or project level - in the 'Analysis Database' tab of either • The username used in the Analysis Database has to match what was used for the data source definition.

Data rules and metrics

© Copyright IBM Corporation 2016

User-Named output tables - auto-registration

By specifying auto-registration you can automatically use the user-named table in your IA project; it will also automatically import the metadata for the data. However, some setup steps are required before creating the data rule that creates the user-named table. • Define the IADB as a data source • Specify certain options in Home > Analysis > Analysis Settings If you set this at the global level it will automatically be set for new projects, but will not affect existing project; you would need to go to existing projects’ project properties and adjust the analysis settings there.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-22

Unit 12 Data rules and metrics

Define the IADB as a source

Home > Configuration > Sources Data rules and metrics

© Copyright IBM Corporation 2016

Define the IADB as a source

This screen is the same screen used to create the chemcoseq data source for the lab demonstrations. Note, however, that the IADB itself is being defined as a data source using the ODBC connector.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-23

Unit 12 Data rules and metrics

Set IADB as a project data source

Home > Configuration > Analysis Settings > Analysis Database Data rules and metrics

© Copyright IBM Corporation 2016

Set IADB as a project data source

This screen if found at the project properties level on the Analysis Database tab. This screen will not appear as shown unless the IADB has already been defined as a data source.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-24

Unit 12 Data rules and metrics

Select option on rule bindings

Data rules and metrics

© Copyright IBM Corporation 2016

Select option on rule bindings

This screen was shown in an earlier slide but not when the IADB had been set up to allow auto-registration. You can now see that the ‘Include in Project Data Sources’ checkbox is active and can be specified for a user-named table (not yet named in this screenshot).

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-25

Unit 12 Data rules and metrics

Purging output tables - manual method • Over time large number of tables build up • Need an easy way to purge the old tables so the database doesn’t fill up • 'Purging' feature accomplishes this • Can be done with the GUI or • Many more command line interface (CLI) options available documented here:  http://www-

01.ibm.com/support/docview.wss?uid=swg21593395

Data rules and metrics

© Copyright IBM Corporation 2016

Purging output tables - manual method

Output tables can accumulate over time and consume database space. Two manual methods at the project level are documented on this slide.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-26

Unit 12 Data rules and metrics

Purging output tables - global solution • All data quality output tables can be manually purged from a project • To do this in the UI, you can choose the File > Purge option • To do the same from the CLI, use the deleteExecutionHistory command IAAdmin -user -password -host -port -deleteOutputTable projectName “*” -ruleName “*” -executionID “*”

Data rules and metrics

© Copyright IBM Corporation 2016

Purging output tables - global solution

The methods documented on the slide are manual but are done above the project level; in other words, it spans projects.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-27

Unit 12 Data rules and metrics

Purging output tables - automatic method • Output tables can be purged automatically through settings in the project’s properties. • Tables can be purged automatically based on age or number of runs • From the UI go to the project’s properties and make changes in the Details tab

Data rules and metrics

© Copyright IBM Corporation 2016

Purging output tables - automatic method

The automatic purging method can be accomplished using settings in the project’s properties area. Use the property settings shown on this graphic.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-28

Unit 12 Data rules and metrics

Purging output tables - per rule • Output Tables can be purged on a per rule basis as well • In the UI, go to the Data Quality workspace, select a rule and choose the ‘Purge Output Tables’ task • To do the same form the CLI, use the -deleteExecutionHistory

command IAAdmin -user -password -host -port -deleteOutputTable -projectName “projectName” -ruleName “ruleName” (-executionID “*” | -olderThanNWeeks 2 | etc)

Data rules and metrics

© Copyright IBM Corporation 2016

Purging output tables - per rule

Lastly, output tables can be purged by rule. Two methods are documented – one is done using the GUI and the other is performed using the CLI.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-29

Unit 12 Data rules and metrics

Metrics

• Provide ability to define a quality score value • Allows user to create mathematical expression comprised of key

quality statistics obtained from other rules, rule sets, or other metrics

• Composite scoring can be used to establish key indicators • Results can be trended and evaluated against a defined benchmark

Data rules and metrics

© Copyright IBM Corporation 2016

Metrics

Another data quality control is the metric. Metrics give you the ability to define a data quality score or value. The scores can be the results of a mathematical equation and can use data rule statistics as input variables. Scores can be stored and trended over time. They can also be evaluated against a defined benchmark.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-30

Unit 12 Data rules and metrics

Metrics guiding concepts

• User-defined  Can establish measures/weights/costs for rules or rule sets

• Flexible  Can select which components to incorporate and review • Categorical  Can organize components within relevant categories/folders (one or many)

• Historical  Can capture and retain results over time • Deployable  Can transfer the rule to another environment (for example, production)

• Auditable  Can identify who modified a metric and when

Data rules and metrics

© Copyright IBM Corporation 2016

Metrics guiding concepts

The guidelines used to build metrics is much the same as the guidelines used to create data rules. That is, they should be user defined, flexible, able to be categorized, historical, easily deployed, and subject to auditing.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-31

Unit 12 Data rules and metrics

Summary of Information Analyzer quality controls

• Data rule definitions are logical • Data rules are generated from a data rule definition and are executable • Both data rules and data rule definitions: 

Used to verify data conditions – such as Exists



Used to verify relationships within the data



Can be organized into categories



Can be grouped into rule sets

• Benchmarks can be used in any quality control and will establish performance thresholds

• Metrics assign a score to a specific quality control execution

Data rules and metrics

© Copyright IBM Corporation 2016

Summary of Information Analyzer quality controls

This slide briefly summarizes some of the points made in this unit regarding data rules.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-32

Unit 12 Data rules and metrics

Checkpoint

• True or False? Data rule definitions do not have a physical binding. • How are quality components secured? • What quality components can metrics evaluate?

Data rules and metrics

© Copyright IBM Corporation 2016

Checkpoint

Please attempt to answer these checkpoint questions to assess your understanding of the material presented.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-33

Unit 12 Data rules and metrics

Checkpoint solutions

• True or False? Data rule definitions do not have a physical binding. True

• How are quality components secured? True

• What quality components can metrics evaluate? True

Data rules and metrics

© Copyright IBM Corporation 2016

Checkpoint solutions

This slide shows answers to the checkpoint questions.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-34

Unit 12 Data rules and metrics

Data Quality Demonstrations

• • • • • • • •

Demonstration 1: Data Rules using logical variables Demonstration 2: Data Rules using functions Demonstration 3: Test a data rule definition Demonstration 4: Managing Rule output tables Demonstration 5: Rule Sets Demonstration 6: Organizing with folders Demonstration 7: Metrics Demonstration 8: Summary Statistics on My Home

Data rules and metrics

© Copyright IBM Corporation 2016

Data Quality Demonstrations

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-35

Unit 12 Data rules and metrics

Demonstration 1 Data Rules using logical variables

• • • •

Create a new data rule definition Create a logical variable Construct the rule logic Create a rule definition

Data rules and metrics

© Copyright IBM Corporation 2016

Demonstration 1: Data Rules using logical variables

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-36

Unit 12 Data rules and metrics

Demonstration 1: Data Rules using logical variables Purpose: This demonstration shows how to build flexible data rules by using logical rules. In this demonstration, you will learn to create logical variables and develop rule logic using these logical variables. Logical variables are similar to terms except they are not shared outside of Information Analyzer. They represent logical business entities independent of a fixed connection to a physical source. Once created, logical placeholders can be contained by a single definition, shared across a project, or shared across all Information Analyzer projects according to the scope of the variable: local, project-wide, or application-wide. Using project-wide and application-wide logical variables in the construction of data rule definitions provides powerful update capabilities. By storing the binding to a physical source within a logical variable, an administrator can modify a single binding and effectively "rewire" all rule definitions that utilize the logical variable within the designated scope.

Task 1. Construct data rules using logical variables.

1. 2. 3. 4. 5. 6. 7. 8.

You will now begin the process of translating business requirements into data rule definitions. These definitions are the building blocks for investigating data quality issues. A data rule definition contains the rule logic associated with a particular condition. By separating the rule logic from physical data sources, rules can be written from a logical (abstract) perspective. This allows data rule definitions to be used in multiple scenarios by simply associating logical components to physical data. The relationship is stored in a data rule executable. Launch Information Server Console, login as student/student. Open the Chemco project. From the pillar menu, click Develop and then select Data Quality. Click Manage Folders in the Tasks list. Click the Create New button in the lower-right portion of the pane. Name the folder Company and then click OK. Create another folder named Item. Do not create this under the Company folder. Click Close.

© Copyright IBM Corp. 2007, 2016 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

12-37

Unit 12 Data rules and metrics

Task 2. Create a new data rule definition.

1. 2. 3. 4. 5.

6.

7. 8.

9.

Create a data rule definition in the Data Quality workspace. The Data Quality workspace contains the data rule definitions, rule set definitions, executable rules, metrics, and monitors for the project. Click New Data Rule Definition in the Tasks list. Enter Division Code Numeric in the Name field. Enter Division codes must be numeric in the Short Description field. Check the Include Benchmark checkbox. Click the Benchmark drop-down list. A list of measures will appear.

Select % Not Met (this is the default). This measure represents the percentage of records in the full set of records that do not meet the rule logic. Click the drop-down to the immediate right. A list of operators will appear. Select