Understanding the Predictive Analytics Lifecycle [1 ed.] 9781118938935, 9781118867105

A high-level, informal look at the different stages of the predictive analytics cycle Understanding the Predictive Analy

255 55 1MB

English Pages 240 Year 2014

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Understanding the Predictive Analytics Lifecycle [1 ed.]
 9781118938935, 9781118867105

Citation preview

Understanding the Predictive Analytics Life Cycle

Wiley & SAS Business Series

The Wiley & SAS Business Series presents books that help senior-level managers with their critical management decisions. Titles in the Wiley & SAS Business Series include: Analytics in a Big Data World: The Essential Guide to Data Science and its Applications by Bart Baesens Bank Fraud: Using Technology to Combat Losses by Revathi Subramanian Big Data Analytics: Turning Big Data into Big Money by Frank Ohlhorst Big Data, Big Innovation: Enabling Competitive Differentiation through Business Analytics by Evan Stubbs Business Analytics for Customer Intelligence by Gert Laursen Business Intelligence Applied: Implementing an Effective Information and Communications Technology Infrastructure by Michael Gendron Business Intelligence and the Cloud: Strategic Implementation Guide by Michael S. Gendron Business Transformation: A Roadmap for Maximizing Organizational Insights by Aiman Zeid Connecting Organizational Silos: Taking Knowledge Flow Management to the Next Level with Social Media by Frank Leistner Delivering Business Analytics: Practical Guidelines for Best Practice by Evan Stubbs Demand-Driven Forecasting: A Structured Approach to Forecasting, Second Edition by Charles Chase Demand-Driven Inventory Optimization and Replenishment: Creating a More Efficient Supply Chain by Robert A. Davis

Developing Human Capital: Using Analytics to Plan and Optimize Your Learning and Development Investments by Gene Pease, Barbara Beresford, and Lew Walker The Executive’s Guide to Enterprise Social Media Strategy: How Social Networks Are Radically Transforming Your Business by David Thomas and Mike Barlow Economic and Business Forecasting: Analyzing and Interpreting Econometric Results by John Silvia, Azhar Iqbal, Kaylyn Swankoski, Sarah Watt, and Sam Bullard Foreign Currency Financial Reporting from Euros to Yen to Yuan: A Guide to Fundamental Concepts and Practical Applications by Robert Rowan Harness Oil and Gas Big Data with Analytics: Optimize Exploration and Production with Data Driven Models by Keith Holdaway Health Analytics: Gaining the Insights to Transform Health Care by Jason Burke Heuristics in Analytics: A Practical Perspective of What Influences Our Analytical World by Carlos Andre Reis Pinheiro and Fiona McNeill Human Capital Analytics: How to Harness the Potential of Your Organization’s Greatest Asset by Gene Pease, Boyce Byerly, and Jac Fitz-enz Implement, Improve, and Expand Your Statewide Longitudinal Data System: Creating a Culture of Data in Education by Jamie McQuiggan and Armistead Sapp Killer Analytics: Top 20 Metrics Missing from Your Balance Sheet by Mark Brown Predictive Analytics for Human Resources by Jac Fitz-enz and John Mattox II Predictive Business Analytics: Forward-Looking Capabilities to Improve Business Performance by Lawrence Maisel and Gary Cokins Retail Analytics: The Secret Weapon by Emmett Cox Social Network Analysis in Telecommunications by Carlos Andre Reis Pinheiro Statistical Thinking: Improving Business Performance, Second Edition by Roger W. Hoerl and Ronald D. Snee

Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics by Bill Franks Too Big to Ignore: The Business Case for Big Data by Phil Simon Using Big Data Analytics: Turning Big Data into Big Money by Jared Dean The Value of Business Analytics: Identifying the Path to Profitability by Evan Stubbs The Visual Organization: Data Visualization, Big Data, and the Quest for Better Decisions by Phil Simon Win with Advanced Business Analytics: Creating Business Value from Your Data by Jean Paul Isson and Jesse Harriott For more information on any of the above titles, please visit www .wiley.com.

Understanding the Predictive Analytics Life Cycle Alberto Cordoba

Cover image: © iStock.com/oliopi Cover design: Wiley Copyright © 2014 by Alberto Cordoba. All rights reserved. Published by John Wiley & Sons, Inc., Hoboken, New Jersey. Published simultaneously in Canada. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600, or on the Web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley publishes in a variety of print and electronic formats and by print-ondemand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com. Library of Congress Cataloging-in-Publication Data: ISBN 978-1-118-86710-5 (Hardcover) ISBN 978-1-118-93893-5 (ePDF) ISBN 978-1-118-93892-8 (ePub) Printed in the United States of America 10 9 8 7 6 5 4 3 2 1

Dedicated to my loving family

Contents Foreword  xi Preface  xiii Acknowledgments  xv Chapter 1 Problem Identification and Definition  1 Importance of Clear Business Objectives  4 Office Politics  8 Note 13

Chapter 2 Design and Build  15 Managing Phase  16 Planning Phase  18 Delivery Phase  19 Notes 32

Chapter 3 Data Acquisition  33 Data: The Fuel for Analytics  36 A Data Scientist’s Job  41 Notes 53

Chapter 4 Exploration and Reporting  55 Visualization 57 Cloud Reporting  61

Chapter 5 Modeling  69 Churn Model  71 Risk Scoring Model  77 Notes 99

Chapter 6 Actionable Analytics  101 Digital Asset Management  104 Social Media  104

ix

x 

▸  C O N T E N T S

Chapter 7 Feedback  129 What the Different Software Components Should Do  132 Note 148

Conclusion  149 Appendix: Useful Questions   155 Bibliography  209 About the Author   211 Index  213

Foreword

With this book, Al has made an astonishing contribution to the growing body of knowledge about business analytics. The book offers an unprecedented look into the gory details of life as a programmer/ analyst, BI specialist, researcher, project manager, data scientist, and consultant. It offers examples of problem solving that could only have been applied by using the progressive power of information technologies that was mastered in the 1990s and 2000s. Have there been other books about this topic? Of course, but none has portrayed the human side of this global endeavor with so much enthusiasm and humor before. For the first time, Al personifies the characters that play important roles in the lifecycle of the generation and use of predictive analytics, showing their creative abilities in industries such as banking, megaresorts, mobile operators, healthcare, manufacturing, and retail. These personifications helps us all better understand and manage this big and complex process of deriving information from data in today’s increasingly sophisticated race to drive productivity and innovation. This understanding is essential to excel at providing an outstanding customer experience, manage customer churn, and perform dataintensive marketing campaigns. This book has many “novelistic” aspects and is very conversational. This way the stories do make each section more personal and relatable. The “Note Files” in particular will be quite helpful for readers as they present examples of real-life analytics projects. These files can be used as good starting seeds for projects. Leigh Watts

xi

Preface

People get excited about big data and business analytics. That might sound ridiculous, but I’m not kidding. Over the years, I have traveled everywhere from Brazil to Japan working on analytics, and in every country, I have found a particularly peculiar brand of analytic fanaticism. These analysts are hilarious and downright exciting! I wrote this book to help business and IT professionals understand the predictive analytics lifecycle. A reader can get a sense for the entire predictive cycle and therefore avoid potential risks. Often folks who work in a particular area of the analytics cycle—for example, analysis—have little understanding of another area, such as integration. This situation sometimes creates confusions, poor communication, and delays. This book has seven chapters, which illustrate a complete cycle from idea definition to feedback. In each chapter, I added notes at the end with examples. The first chapter concentrates on the initial stages of defining the problem at hand and how to get the business customer to quantify the overall business value of the project. The note file contains a sample nondisclosure agreement. Chapter 2, the design chapter, focuses on the planning process. It shows how to define various components of the scope, such as what organizations are affected, what to expect, and the type of information required. The note file has a sample project for a data warehouse performance management project. Chapter 3, the integration chapter, discusses the process of bringing data together to build a file ready for analysis. This chapter includes two notes: One is a data-quality sample project, and the other is a file with a description of Hadoop and how it works with SAS. Chapter 4, the reporting and visualization chapter, illustrates how reporting and visualization techniques are used to review and make sense of data. The chapter includes a note file with an example of an analytic project focused on xiii

xiv 

▸  P R E F A C E

guest loyalty for a cruise ship. Chapter 5, the analysis chapter, pre­ sents how to build a couple of analytical models: a churn model and a scorecard. The note file presents concepts on fraud, waste, and abuse analytics. Chapter 6, about actionable analytics, describes how to use results from the predictive modeling in campaigns. There are two note files. The first has a simple assessment to identify gaps in a CRM analytics platform. The second is a sample project of the construction of a predictive analytics framework for a mobile operator. Chapter 7, the feedback chapter, discusses the iterative nature of the predictive analytics cycle and the importance of including feedback in the development of new models. The conclusion chapter provides a high-level view of the entire analytics lifecycle. The appendix contains more than 1,000 questions that can be used to qualify predictive analytics projects or simply to break the ice with both IT and business professionals interested in applied analytics projects. This book is a mix of technical knowledge and business analytics humor. The names and the actions of the companies and employees have been changed in the interest of making the stories funnier and the content more readable. This book is for anyone who wants to gain a better understanding of the development cycle of business ­analytics in an entertaining way. Follow professionals of the Information Age as they tackle big data in this fascinating collection of case studies in different industries from around the world based on my real-life experience.

Acknowledgments

Since I began working with business analytics in 1985, I have had the good fortune to work and learn from some of the best minds in the world of data. When I joined SAS in 1993, I began to see the excitement that companies experience when they realize that they have finally found a way to use internal data to better understand both their organization and their customers and better manage their own performance by developing key performance indicators. I am grateful for the many conversations with my SAS colleagues and customers over these many years. These conversations gave me the chance to see predictive analytics in action in different industries and different geographies, and helped me appreciate what predictive analytics contributes to the enhancement of customer experiences worldwide and how it generates value for organizations. I feel fortunate to have this opportunity to say thank you to all of the amazing people who have coached me, including Jim Goodnight, Clive Pearson, Jeff Babcock, Eric Yao, Herbert Kirk, Lee Richardson, David Fender, Alan Spielman, Steve Gammarino, Rajani Nelamangala, Phil Hyatt, Helen-Jean Talbott, Chuck Zebrowski, Barrett Joyner, Leigh Haddon, Andy Bagwell, Jose Carvalho, Mariana Clampett, Barrett Joyner, Andre Boisvert, Carmelina Collado, Monica Grandeze, Marcos Arancibia, Kimio Momose, Carol Forhan, Bill Marder, Tony Pepitone, and Jon Conklin. Many more colleagues have contributed to my professional development, and I am grateful to them. I can’t forget to mention my eldest daughter, Sienna, for her willingness to work with me on the project and her flexibility and insight into the manuscript. I would also like to thank my three younger children for their undying love and support: Ines, Sofia, and Diego. Finally, I want to extend a very special, love-filled thank you to my beautiful wife, Clara Maria. Al Cordoba xv

C H A P T E R

1

Problem Identification and Definition

How executives focus resources and assess an organization’s readiness for meeting the challenges posed by new business realities

PROBLEM IDENTIFICATION AND DEFINITION

Design and build

Feedback

Data acquisition

Exploration and reporting

Actionable analytics Analysis

1

2 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

R

ecently I met with a pair of business executives at the Gaylord Convention Center near Washington, DC. Two analysts glided their way toward me. I smiled and went in for handshakes, exclaiming “Hello there!” Their names were Zizi and Javier. Both worked for a big corporation right outside of the Beltway in Maryland. I quickly launched into a flurry of business jargon, briskly walking toward the coffee kiosk, mouth running at a hundred million miles per minute. The executives shuffled after me, saying “We are very interested in finding out more about developing a modern analytical system.” I bought a soy latte with an extra espresso shot. As the caffeine kicked in, I started by asking, “What is your firm’s level of analytical maturity?” Javier looked at me and said, “Before we get started, do we have an NDA in place?” A nondisclosure agreement is a document signed to protect both parties. (A sample agreement is presented at the end of this chapter.) “We sure do,” I answered. “Great! So let’s continue.” Javier stammered, “I-I don’t know. I believe that analysis is a portion of the transformation cycle from data to knowledge to wisdom. So, probably the analytical maturity of an enterprise would tell how well it can leverage analysis and close the information gap. I am not sure where I would say our company is exactly.” My eyes met his as I popped a huge sparkly smile. “Everybody knows the four key levels of an analytical framework are. . . .” I waited for a response. Zizi replied, “Infrastructure, functionality, organization, and business, and these levels can be translated into an information evolution model for analytical applications.”1 Javier piped up, “What is the importance of this?” I answered, “Those organizations that try simply to define and implement an advanced analytical solution in one step may end up taking far too long to finish building it and reap its benefits.” Zizi lowered her glasses and continued my thought seamlessly. “And then, most likely, the analytical solution delivered will not meet needs because requirements usually change after an initiative is initiated or because the technology has already changed. We’ve been through that before.”

P R O B L E M I D E N T I F I C A T I O N A N D D E F I N I T I O N   ◂ 

3

“Exactly!” I added, “There is an overarching need to build flexibility into contemporary analytical systems. Particularly now that data are growing exponentially and we are faced with big data everywhere. I believe enterprises need to assess the overall maturity of their analytical initiative and aim to add value incrementally rather than use an all-at-once approach. This is very important with the big data challenges. Results and challenges differ depending on the level of analytical maturity. I think the assessment of needs for an analytic platform or workbench should include choosing an appropriate software architecture for analysis and reporting, a hardware environment, a big data integration approach, and, of course, a data model for their structured data, among other things.” They wondered, “Is that enough to ascertain success?” I told it to them straight. “Hey, it’s anybody’s guess, but it increases the probability of success significantly!” Results usually are measured in terms of effective usage of information technology (IT) investments and improved operational efficiency. Challenges primarily occur with IT infrastructure, culture, software technology, and functionality. They looked at each other warily. I tried to reassure them a little bit. “Improved results usually are associated initially with having one version of analysis-derived information, the so-called truth, which improves the management of multiple departments. Some of the organizational challenges begin to take more focus and skills from the project team. Good results are associated with improved and faster decisionmaking processes than the competition.” I decided not to mention the challenges that often occur at the business level, such as shifting business processes and methodologies to leverage new analytical capabilities for corporate performance management. Or changing business goals or objectives, based on insight gained. They were too apprehensive. Therefore, I wanted to stick with the most basic and positive aspects of reworking their business objectives. I continued, “As your consultant, I have to ask you: Where is your firm going? Is the gut feeling still driving decision making? A successful analytical initiative needs good strategic business objectives.”

4 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

They winced at that statement. They knew I was right. Javier shook his head back and forth and sighed.

IMPORTANCE OF CLEAR BUSINESS OBJECTIVES I patted them on the backs. “Business objectives must drive analytical initiatives and investments. The success of an analytical initiative should be measured by how it affects strategic and operational business objectives—not how many rows of big structured and unstructured data can be loaded into a data framework in six hours or the complexity of a model developed. This is particularly true when we consider the vast amounts of data that most organizations have accumulated and that continue growing.” Obviously, the lack of clearly defined business objectives would make assessing the success or value impact of an analytical initiative impossible. “Do you think that you can use an analytical framework to align IT system initiatives with business objectives and make strategic choices?” I asked the executives. “It could be the best thing that ever happened to you.” I recommended that they conduct a business value evaluation prior to investing in an analytical initiative. “This evaluation will provide a quick and low-cost validation of an analytical project’s proposed direction and deliverables. This evaluation will also bring focus and attention to an analytical initiative within your IT organization and the potential business stakeholders. It can also pinpoint weaknesses and threats to the future project.” Zizi asked, “Do you think the evaluation will start a dialogue between the internal groups like the IT organization and business users that will identify the business objectives?” I smiled and said, “Yes. It will identify how analysis can contribute to the success of business objectives. It will set the scope and size of the project and determine the appropriate investment levels. This is just the beginning. Even if your analytical initiative is already under way, I think that if you take a step back to assess the initiative, you may discover new areas of additional leverage or new risks.”

P R O B L E M I D E N T I F I C A T I O N A N D D E F I N I T I O N   ◂ 

5

I continued to urge them to face the bitter truths of today’s analytic realities. “It is important to ask questions to better understand which other business opportunities and objectives should be addressed and funded within the analytical initiative. It may be just as important to identify areas to keep outside of the project or that should come along in a second wave.” I took out my phone, glancing at the time. “Why is your organization embarking in analytical applications and big data insight anyway?” Zizi said, “Al, we need to stay competitive, and this is really exciting. We also have a new executive team with the right approach to data and what we can do with it. Just think where this could lead with your help.” I appreciated that comment. “Thank you. I think we are on the right path. Business value in the real world can be achieved only when you leverage data that are relevant, accurate, timely, consistent, and, most of all, accessible. Most organizations that I have worked with start an advanced analytical project in an effort to drive revenue, increase profitability, optimize certain processes, decrease cost, make better decisions, manage the objectives, minimize risk, and/or improve infrastructure functionality. Does any one of those goals sound like your objectives?” They eagerly nodded. I continued, “It sounds like you are planning to use analytical applications to gain a competitive edge in a highly competitive market. If so, are your specific business objectives well articulated? Do you already have your performance measures defined? How well and often are your key performance indicators (KPIs) measured and analyzed? Do the appropriate internal and external users have access to relevant data and analysis? Can you look at the KPIs and easily drill down for additional data? What would be the impact of new insights derived from increased or improved data access or analysis? What would be the impact of more real-time data and/or advanced predictive analysis? Is executive sponsorship and funding available?” Their heads were spinning so I recommended an easy first step. “Analyze the strategic and tactical business objectives that will drive this analytical initiative and its funding. These objectives ultimately will define your project success.”

6 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

We looked at each other across the table in the atrium of Gaylord’s. An hour had gone by, and they confessed they were nervous. I could not blame them. The assignment to develop an analytical application initiative first requires a readiness test. As obvious as it seems, an assessment of the IT organization and business user skills, levels of analytical activity, and culture will help the enterprise determine the probability of an analytical initiative’s success—before it makes any significant investments. I flashed them a million-dollar grin and encouraged them to feel excited about the upcoming changes. “Before embarking on a big data acquisition adventure and its complementary analytical initiative, an enterprise like yours should complete a self-assessment to determine readiness. You must honestly evaluate your available skills, existing processes, and levels of analytical capabilities and culture so that, before spending considerable sums of money, you understand the challenges ahead and have a way to determine how to proceed and the likelihood of success.” Zizi raised her hand. “To assess potential for analytical success, should we rate the level of engagement on analytics demonstrated by both our IT department and our business user community?” I quickly replied, “Yes! First, you should rate the degree to which the following statements apply to your technical organization: Does IT understand the need for and potential of analytical applications? Does IT have the required skills and resources to support an analytical environment? Is IT taking responsibility for setting up an analytical infrastructure? Does IT act as a catalyst for technical improvements in the enterprise? Is IT respected within the enterprise? Does IT have a history of success?” Javier lifted an eyebrow. “What about the business side? In your experience, typically, do business users understand the need for and potential of analytical applications? Do they have a history of funding and championing analytics initiatives? Are the business users the ones to drive IT to deploy new technology? Do they seek an active partnership from the IT organization? Should the business user community participate in the technology selection and adoption process?” I responded, “Well, you are going to have to do the detail work of answering all of those great questions. I have seen a bunch of different combinations. However, the importance of a readiness assessment is

P R O B L E M I D E N T I F I C A T I O N A N D D E F I N I T I O N   ◂ 

7

undeniably clear. Any successful enterprise needs a portfolio of analytical applications to address the needs of a broad range of user requirements. But before it can develop that portfolio, the enterprise must determine what appropriate technical infrastructure and development methodologies are already in place, including: a platform to source data (e.g., a data mart, data warehouse, operational data store, multidimensional cubes, massive parallel processing (MPP) databases, big data framework), available data models and business definitions, rules for metadata use and integration, support for real-time use, access to cloud computing resources, and, when appropriate, methodologies for development, deployment, and change management. In addition, software for data management, data exploration, advanced analytics, and campaign management are also typically required.” They looked a little flustered with my tech talk, but I wanted to cover a few more points before lunch so I quickly continued, “Initially, you should make sure that functionality is sufficient to ensure that an analytical initiative could deliver value. Later on, among many other tasks, you or your consultant team will define user requirements, decide whether to build or buy analytic applications, determine enterprise security and user access levels, assess scalability, and ask your IT counterparts to establish standards that match user types to appropriate tools.” Zizi asked, “Shouldn’t the data be quality data to ensure that the analytical initiative delivers the expected value?” I said, “You know the concept ‘garbage in, garbage out’? Definitely, data governance and data consistency are a high priority. For example, you should inventory data sources and means of access, identify data stewards, identify data quality solutions, and define methods to extract and transform data efficiently and correctly.” I paused to think. “Be aware of timing. Many new infrastructure and functionality requirements are identified approximately six months after the initial analytical deployment. This makes an effective implementation methodology critical to ensure all the respective resources and skills are available throughout the system development life cycle to address those new requirements.” Javier asked, “Will this technology assessment help validate technical and cost assumptions?”

8 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

I clasped my hands together and nodded slowly. “It will identify whether any critical factors were overlooked. It will spot potential weaknesses in the implementation of a plan.” Zizi gave me a hard, discerning look. “What advanced analytical functionality does our company need, and what is the difference between that functionality and the kind of functionality we are using today?” I beamed at her. “The analytical function can be seen in four main areas: integrate, report, model, and enable. First, “integrate” refers to the ability to collect and organize diverse data and make it ready and accessible for advanced analytical applications. It includes structured data like that generated in operational systems. Typically, it comes from database management systems. It also increasingly includes what is called nonstructured data coming from Web records and social networks and typically is very big. Today this data integration area is called information management. Second, we see an area for data exploration, visualization, and reporting. Third, “model” refers to the actual advanced quantitative modeling that takes advantage of statistical or mathematical techniques to gather information out of the data. The fourth area is execution. The predictive analytics function is an enabler of other applications like customer service, financial intelligence, or marketing services by improving the communications efficiency. I like your question. It is looking toward the future of analytics at your company. I see we are making progress, and I am becoming more confident about your company’s potential for success.” As you can gather from the previous ideas, it is important to keep in mind that most organizations can derive great benefits when they provide these four functionalities using software as part of an integrated system within the context of an analytical framework. Most traditional analytic software platforms provide extraction, transformation languages, SQL generation, standard reporting, visualization, what-if analysis, alerts, corporate dashboards, statistics, data mining, advanced analysis and forecasting, campaign management, and optimization.

OFFICE POLITICS Javier scribbled a few notes on his iPad. “Considering the types of insight required and the interaction with different types of users, how

P R O B L E M I D E N T I F I C A T I O N A N D D E F I N I T I O N   ◂ 

9

will we determine what functionality we need from the software we choose? There are so many choices.” I agreed. “It is very confusing. You need to use a methodology to sort through all the vendors and tools. It is critical to have a clear objective in mind. For an analytical initiative to succeed, different types of users—personas—will need different software tools. Providing casual users like business analysts with analytical tools primarily intended for power users—that is, statistical programmers—will overwhelm the analysts, who most likely do not have the skills or the time to learn about these advanced tools. Likewise, asking power users—that is, programmers—to use simple reporting tools for their analysis work would not work. An inventory of existing tools and user types and their competency levels with a particular software tool will help the organization when the time arrives to select vendors. Most analytical vendors are beginning to deliver enhanced or next-generation products that merge data management, visualization, reporting, analysis, and communications functionality. As a result, a wider range of user types will have access to a broader range of functionality from a single and integrated analytical environment. Don’t forget the access control requirements for users.” Javier asked warily, “Well, that all sounds good, but there must be a catch. What are some of the hidden costs associated with these analytical initiatives? Is that what people call the total cost of ownership?” I told him the truth: “That’s correct! Over time organizations have adopted a large number of disparate and unrelated analytical technologies, adding to tool fragmentation in their organizations. This situation creates problems of support; when something fails, vendors blame each other. It also creates training problems when diverse applications work in different ways for no reason. This situation has also been complicated by the mergers and acquisitions within the analytical vendor community.” Javier perked up. “Yeah! Our organization has been frustrated in our ability to deploy analytical solutions effectively because of the overabundance of unrelated end user technologies from various vendors. Our end users and the IT organization have deployed various analytical tools without much (if any) thought about integration,

10 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

future needs, or issues. Unfortunately, they were reacting to the dayto-day pressures.” I totally agreed with him. “A random portfolio of software tools in any organization typically contains products that are current and relevant, older but still-used products, and discontinued and unsupported fads. An organization may also have what is called shelfware, software that has been bought and nobody uses it.” Zizi looked at her watch. “Look, we know that our organization needs to find the right number and mix of tools. To do this, we must stop the proliferation of analytical tools. That will ensure that we can centralize and provide a consistent and manageable analytical environment for our internal users.” “To stop proliferation, you must enforce some standardization around analytical tools and governance for your data. This can be difficult because end users are usually partial to certain tools and resist changing the analytical tools they use and how they operate,” I added. “Let’s don’t forget politics. Our analytical initiatives could span multiple business and functional groups in our organization. I could see that the politics associated with getting participation, data, and resources from the internal groups could introduce challenges and delays to an analytical initiative,” Zizi said. Javier said thoughtfully, “Let me add something here. Our political and organizational challenges are unique. The politics of who has control related to visibility, information, resources, funding, and technology choices often leads to delays in IT initiatives. I think a cross-functional analytical initiative will fail quickly if it does not have a credible team leadership that anticipates and addresses these challenges.” He obviously had extensive experience with office politics. I tried to sum things up as the minute hand clicked away on Zizi’s watch. “The readiness evaluation should include technology, people, process, and politics. It is a package deal.” We shook hands and agreed to meet via Skype later on that week. I left them sitting at the table reevaluating their company’s subjective and objective levels of analytical well-being, excited about taking the plunge into a new and fascinating analytical mind-set for their company.

P R O B L E M I D E N T I F I C A T I O N A N D D E F I N I T I O N   ◂ 

11

NOTE FILE: SAMPLE MUTUAL CONFIDENTIALITY AGREEMENT THIS CONFIDENTIALITY AGREEMENT (this “Agreement”), effective this DATE, is by and between PARTY1 NAME (“PARTY1”) and PARTY2 NAME (“PARTY2”). Representatives of PARTY1 plan to meet with representatives of Potential Partner to consider establishing a business relationship or providing products or services (collectively, the “Transaction”). In connection with the discussions, each party might disclose certain confidential information to the other. As a condition of disclosing confidential information, the parties have agreed to treat such information as stated in this Agreement. IN CONSIDERATION of the mutual obligations of the parties, the parties hereby agree as follows: 1. “Confidential Information” Defined. “Confidential Information” means all information disclosed by or on behalf of the disclosing party to or obtained by the receiving party concerning the disclosing party’s business or any product or service developed (or proposed to be developed) by the disclosing party, and whether disclosed in writing, orally, or by inspection. Confidential Information may include, but is not limited to, developer information, pricing, customized products and services, designs, specifications, technical information, protocols, process information, code, software, financial data, business plans, marketing plans, trade secrets, processes, and techniques. Notwithstanding the foregoing, Confidential Information shall not include: a. Information that at the time of disclosure is in the public domain or is otherwise available to the receiving party other than on a confiden­tial basis; b. Information that, after disclosure, becomes a part of the public domain by publication or otherwise through no fault of the receiving party or any third party under a confidential agreement with the disclosing party; c. Information disclosed to the receiving party by a third party not under an obligation of confidentiality to the disclosing party; or d. Information that is or has been developed by the receiving party (as evidenced by the receiving party’s records) independent of the disclosures by the disclosing party. 2. Covenant of Confidentiality. The receiving party agrees to retain in confidence all Confidential Information. The receiving party further agrees

12 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

that it will not use or disclose to any third party, nor permit the use or disclosure to any third party of, any Confidential Information, except that the receiving party may make the Confidential Information available to its directors, officers, employees, and attorneys (collectively, its “Representatives”) who agree to be bound to the terms of this agreement and who reasonably need the information for the receiving party to evaluate the Transaction and, if the parties agree to undertake a Transaction, for the performance the receiving party’s duties in connection with such Transaction. 3. Covenant to Return Confidential Information. In the event the parties’ discussions terminate, or upon the earlier request of the disclosing party, the receiving party agrees to immediately return to the disclosing party all tangible and intangible documents and files obtained from the disclosing party containing Confidential Information and any materials created or derived from Confidential Information, by whomever or whenever made, without retaining any copies thereof. Once returned, the receiving party agrees to delete all electronic copies of the documents and files from the receiving party’s systems. The receiving party agrees to verify compliance in writing if requested by the disclosing party. 4. Non-Solicitation and Hiring. Neither party shall solicit, hire, or retain directly or indirectly any employee of the other party for a period of 12 months following the later of either the termination date of this agreement or the Transaction termination date. 5. Survival of Confidentiality Obligation. The confidentiality obligation contained herein shall survive the termination of such discussions and negotiations. 6. Remedies for Violation. The parties acknowledge that all Confidential Information disclosed by the disclosing party to the receiving party is significant, confidential, and materially affects the disclosing party’s business and goodwill. The receiving party expressly understands and acknowledges that a violation of this Agreement by the receiving party will cause irreparable injury to the disclosing party, which injury will not be fully compensable by money damages. The parties therefore agree that in the event the receiving party breaches or threatens to breach the covenants contained herein, the disclosing party shall be entitled as a matter of right to a restraining order, an injunction, a decree or decrees of specific performance, or other adequate relief from a court of competent jurisdiction. The provisions of this paragraph shall survive the termination of the parties’ discussions and negotiations pertaining to any Transaction.

P R O B L E M I D E N T I F I C A T I O N A N D D E F I N I T I O N   ◂ 

13

In the event of the receiving party’s breach, the receiving party agrees to indemnify the disclosing party for all costs of enforcement, including reasonable court costs and attorney fees, incurred while enforcing the disclosing party’s rights under the agreement. 7. Severability. If any provision of this Agreement is held to be illegal, invalid, or unenforceable under present or future law effective during the term hereof, such provision shall be fully severable and this Agreement shall be construed and enforced as if such illegal, invalid, or unenforceable provision never comprised a part hereof, and the remaining provisions hereof shall remain in full force and effect and shall not be affected by the illegal, invalid, or unenforceable provision or by its severance. 8. Binding Effect. This Agreement shall be binding upon and shall inure to the benefit of the parties hereto and their respective successors and permitted assigns. IN WITNESS WHEREOF, the parties hereto have executed this Agreement as of the date first above written. SIGNATURES PARTY1 By: ______________________________ Title: _____________________________ Chairman, President, or Vice President PARTY2 By: ______________________________ Title: _____________________________ Chairman, President, or Vice President

NOTE 1. Jim Davis, Gloria J. Miller, and Allan Russell, Information Revolution: Using the Information Evolution Model to Grow Your Business (Hoboken, NJ: John Wiley & Sons, 2006).

C H A P T E R

2

Design and Build

How consultants and project managers design and build modern systems for business analytics

DESIGN AND BUILD

Problem identification and identification

Feedback

Data acquisition

Exploration and reporting

Actionable analytics Analysis

15

16 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

A

traditional sumo wrestling match is accompanied by a lot of ceremony and a lot less mawashi. “Mawashi” is the name of the stiff loincloth that wraps around the thighs of formidable sumo wrestlers. If a wrestler’s mawashi comes completely undone while wrestling, he automatically loses the match. In other words, the heaviest sumo wrestler in the world, with the best plans and tools in the world, with the best intentions in the world, is not guaranteed success. It is the way you do your job, the execution of your ideas, that will make or break the fate of your matches. Sumo wrestling is more than two massive human beings trying to overpower one another. Sumo wrestling is an ancient Shinto tradition, a ritual process with a strict methodology. At the Strategic Consulting Projects Meeting for the Asia Pacific SAS Team Meeting held in Tokyo, Japan, in May 2000, I asked myself, “How do we successfully execute on the four phases of a business integration methodology: manage, plan, deliver, and operate? How do I win the match without my mawashi falling off me?” Here is the strange part. The question came back to me 10 years later as vice president of Qualex Consulting, one of the oldest SAS integrators in the world. The four phases of business integration methodology traditionally are manage, plan, deliver, and operate. Let’s consider these four phases of business integration methodology closely: 1. Manage. This phase focuses on journey navigation and considers three management disciplines: journey management, program management, and project management. 2. Plan. This phase helps clients define strategies and approaches to achieve competitive advantage and build stakeholder value. 3. Deliver. This phase translates concepts and designs into reality. It includes analysis and design, build and test, and deployment. 4. Operate. This phase focuses on achieving and sustaining the benefits of the new business capability implemented in the delivery phase.

MANAGING PHASE Tuesday, 8:00 a.m. I stood in front of a classroom of attentive new hires in downtown Tokyo. Mechanical pencils lay across crisp white notebooks

D esign and B uild  ◂ 

17

at perfect 45-degree angles, ready to be lifted by nimble fingers at any moment. I love that my words were recorded in one of the most beautiful alphabets in the world on beautiful lily petal paper in the year 2000. I winked at the class before I began with my first lesson. “The development of an analytical integration program for an organization includes typically three elements: plan, design, and build. It also includes the ability to identify customer needs during the journey management.” One thing I learned while working in Japan is that the Japanese do not like to ask as many questions or talk as much as folks from the West do. So, I had to pick on them a lot, just to keep them engaged. I eyed a student in the back left corner. “Tadashi, an analytical integration project can be seen as a journey where the consulting project manager is the captain of the ship and the journey manager is the navigator. Aboard the ship, we have users and consultants.” He fumbled with his pencil. I continued, “Another critical concept at the managing phase is the seed. The seed in this scenario would be the detailed map that the navigator will use to guide the ship to each particular destination. The seed defines time, materials, staff, methodologies, and any other required direction. Give me an example of a seed.” Tadashi nervously poked at his cheek with an eraser. “Uh, a seed could be, for instance, an implementation plan for customer retention in a health insurance company.” I was delighted, “Yes! Great example. I brought a seed example from the cruising industry so that you can observe the similarities and differences. [It is included at the end of this chapter.] Journey management is the central process of the managing phase. It focuses on the future. It does not necessarily focus on the big picture. Journey management is like a walk with a decision maker. It is a one-person journey.” I paced the front of the classroom, tapping the dry erase board with a red marker. “One critical component of journey management is the need to assess and interpret project progress. The consultant needs to talk and talk and push and push the ideas before the project initiation. The consultant should enchant the customer and not stop talking after the project starts.”

18 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

I singled out a short-haired woman sitting in the front row. “Why is that, Ayumi?” “The consulting operation needs to foster additional projects to leverage investments and make people aware that the consultant is helping at the customer site,” she said with a smile. I felt optimistic.

PLANNING PHASE As a consultant, you need first and foremost to be aware of the customer’s strategy, which should be either growth or cost cutting. There are two standard ways to win a sumo match: Push your opponent out of the ring or force your opponent to touch the mat with anything but his feet. Then, as I mentioned before, there are other less common ways to lose: If your mawashi drops in front of the crowd, if you don’t show up, anything out of the ordinary, you lose the match. Sometimes I feel like the coach of a sumo wrestler. I feed businesses lots of doughnuts and pizzas if I know their objective is growth, if I know they want to get strong. There is one golden question a consultant should always ask the customer: “Why is this project important for your strategy?” If the consultant does not understand the customer’s strategy, the project scope could be changing constantly. If we learn the customer’s strategy first, we can cut requirements at the user level and prioritize better. This is the most important concept of the planning stage. The next day I asked my students a follow-up question. “The consultant should understand if a project fits within a certain cost initiative or a mission or growth effort. So, which one of these options constitutes a larger opportunity for success?” I didn’t ask anyone directly and waited for someone to volunteer. A few minutes later, a man in the back raised his hand. “Growth?” he said. “Yes, growth!” I exclaimed. I changed the topic. “However, although current analytical methodologies are good at some things, they are not that good for strictly management purposes. You need to learn how to get projects done within a budget and on a schedule. Do you know what I mean, class?”

D esign and B uild  ◂ 

19

I took a moment to think. As senior management, it was my job to educate these junior managers and emphasize that it is always more important to stay within the budget than deliver on time. It is the only way management can run a sustainable operation. I said, “One key risk factor is to move to a new phase of a project without a sign-off.” The class stared back at me blankly, but I forged on. “This action, known as scope creep, will increase the project’s cost without a corresponding increase in budget.” They continued to stare at me. The group was full of anticipation, wanting to know how to be successful. I continued, “The consulting manager should allow time for sign-offs and include a sign-off meeting in the communication plan for the proposal. OK, see you guys tomorrow.” I left the office building, headed for my hotel near the Kabuki Theater. Back in my room, I called room service to order five huge shrimp tempura rolls like a true division one sumo wrestler.

DELIVERY PHASE I told my class, “A business methodology is a step-by-step work description. It includes project start and project management. It also includes risk management and quality management. Please review the sample project charter for a data warehouse [included at the end of the chapter], and let’s discuss it tomorrow.” A young man raised his pointer finger. “Is the goal of methodology standardization?” I shook my head. “Methodology is not necessarily a standardization project. A method is just a common process. Methodology is a starting point, not a solution. Methodology is like a religion.” He asked me again, “So, there is some risk always involved?” I nodded. “The project manager creates a work plan using accepted methodologies and starting from the seed. The project manager hedges 80% of the risk this way. He or she always produces a work plan, even for a small project. There will always be a little risk despite your efforts to curb it.” I continued addressing the group: “In summary, the work

20 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

plans help you identify potential high-risk factors and the suggested activities to mitigate the risk.” Another young man asked, “Project risk management means that the project manager needs to have a checklist before he or she starts a proposal. The main risk for analytical projects is that the solution stated in the proposal is too poorly defined. Is this because of a vague lack of information? What other risk elements can contribute to failure in a particular project?” My eyes lit up. “Yes. Senior managers like me need to sense danger by involving themselves at the start-up meeting or kickoff meeting for projects involving 10 or more people. Managers can develop a feeling for what is going on this way. It is important to describe lots of assumptions on proposals and thus diminish the risk.” An older woman raised a hand. “Must there be some clarity on the exclusions?” I was not sure exactly what she meant by that. “I believe that to get more business, delivery planning should also include account management. It is very important to get ‘another’ job. The consultant needs to talk to the user executive. At the time of project delivery, the consultant and the journey manager should continue to ‘sell.’ If the project manager does not have this mind-set, when the project is complete, he or she feels the job is done and loses out on the opportunity to create more projects and deliver additional value.” She looked at me through her tortoiseshell glasses and said, “But what about scope creep?” “Basically,” I said, “if you as a project manager accept scope changes, then you should make sure to obtain an equitable commitment from the customer, like money or perhaps something else valuable. You need to enchant your customer. Get it?”1 I often encourage executives to use interviewing techniques to obtain more information, so I told the class, “The kind of information collected from an account is proportional to the ability of the interviewer.” The class agreed that it is important to develop this interviewing skill set within a consulting group. Along with improving interview techniques, success comes from young consultants’ motivation. Someone in the second row raised a hand. “Senior managers need to develop young consultants’ trust, commitment, and abilities. Senior

D esign and B uild  ◂ 

21

managers need to focus on listening to young people’s ideas and respect them by always interviewing staff before every assignment. This is important for motivation. Senior managers should also interview staff at the end of a project.” I concurred. “Yes! It is important to set aside time to visit customer projects to talk with young staff. The bottom line in human resources management is that the seed should be exciting. It should be based on people’s excitement, not based on money or easy achievement. Senior management needs to invest and motivate people, change roles frequently, and communicate frequently. Management should be direct in its message. “If you think about it, to make the consulting operation profitable, you need focus on controlling the controllable.” I told the class to take five since they seemed fidgety. My mind wandered as I gazed down at the sprawling metropolis. I thought to myself, “Management needs to think about how to leverage resources. We must insist on young people doing more senior work because it is the key to a profitable business.” Sumo wrestlers live together in huge compounds where their daily lives are strictly monitored and regulated. I was trying to teach these young recruits to set up and become leaders, but I felt like only a few of them would make the cut. I glanced out the window one more time and headed out of the classroom to grab a cup of coffee from the break room. I bumped into one of the younger guys. He was happy to have found me alone. “Management should look to control the cost structure. Remember, cheaper is better! Management needs to work on adequate billing collections negotiated in front of the project. If a problem happens, management can cover the payroll. Management should develop seeds and client contacts because if the consulting group does not have a good seed, it will be very difficult to win a project.” He was right. I told him, “Exactly. The actual price comes from a negotiation, and it is not necessarily the billing rate. An ideal price should be above all costs and include a reasonable margin. An execution discount can be included in the price to account for possible project slippage since the execution discount usually averages 8% of the

22 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

total project time. A subcontractor markup should be included in the price. This markup can be approximately 30% to cover for the risk taken by the prime contractor. As a manager you need to inspect test documents.” He was grinning as he exclaimed, “This is a key concept to increase quality! There is a critical need for standardization. Project management methods should help the process. I love standardization.” I smiled as I reached past him for the coffeepot. “Using the project management methodology, you can shorten the time needed as a junior staff member to ultimately become a good engineer or manager. But remember, building a modern and efficient analytical system is a team effort. You should try to help your coworkers out.” I took a deep breath and gave him my best smile.

NOTE FILE: SAMPLE SEED PROJECT CHARTER FOR A DATA WAREHOUSE PROJECT SUMMARY Customer approached the consultant to take on the role of data warehouse performance (DWP) consultant. Through this role, the consultant will be involved with the DWP and provide assistance with the creation of a DWP road map for the customer project. Customer sees DWP as a key part in its ability to make better and faster decisions during field operations. Project Objectives The DWP Consultancy Project will meet these objectives: ■■

Assist with DWP concept description under consideration with regard to the level of integration and real-time operations.

■■

Integration into the team for preparation for operation.

■■

Creative and constructive contribution to all DWP-related activities of the project.

■■

Facilitation, coordination, and quality control of ongoing activities.

D esign and B uild  ◂ 

23

■■

Contribution to the development of operations organization under DWP aspects.

■■

Representation of the project in public events and seminars.

■■

Design and equipment of collaboration facilities.

■■

Competence management project (training concept—ready for operations in scope for the DWP work flows and processes).

■■

IT strategy.

■■

Assistance with DWP business case.

■■

Assistance in the structuring of the operations organization considering DWP strategy.

■■

Assistance with development of a plan of further activities and follow-up of the project’s critical performance analysis (CPA) with regard to DWP aspects and consideration of actual goals of all technical disciplines and uncertainty lists/ranking.

■■

Synthesize the DWP business case and the CPA.

■■

Follow-up of public DWP activities.

■■

DWP research projects.

Project Time Frame The project will be covering the period commencing on the begin date and ending on the end date. Project Critical Success Factors The successful completion of all streams for this project will be successful only if the three critical success elements listed below are made available to the team in a timely manner. It is important to stress that this project is a joint project between the customer steering committee and the consultant team. The consultant’s teams are here to support and assist in the delivery of the abovementioned streams, but ultimately the ownership and responsibility lies with the customer team. Thus the critical success elements are: ■■

Access to documents required

■■

Access to the extended customer and international customer staff

■■

Access to the customer development team

24 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

General Project Assumptions Under each detailed scoping stream, assumptions specific to the business streams have been defined. For the project generally, the following is assumed: ■■

Consultant is given access to relevant personnel as deemed necessary to complete the deliverables (to be agreed during the project period).

■■

Customer participates with resources in necessary scoping/planning activities, which is facilitated by consultant personnel during the project (who and what effort required to be determined in project).

■■

Consultant is given access to updated and relevant documentation from customer on the current situation.

■■

Necessary technical infrastructure is in place.

■■

Consultant is provided with office space with suitable equipment and infrastructure at customer’s premises for at least 1.5 full-time equivalent (FTE) consultants at any time. Requirements for the office space are access to a workstation with all necessary software and Internet access.

In the event of any data needing to be extracted from data sources, customer is responsible for supplying data from data sources and for the quality of these data. Billing Details: Name, Address, City, State IT-Related Questions ■■

Will the team be given customer e-mail accounts?

■■

Will the team have remote access via virtual private network?

■■

Will the team spend time on site?

■■

Will the team members be given access to the building through ID cards?

■■

Will the team use the normal working day of 8 hours?

PROJECT SCOPE Both high-level and detailed scoping documents have been prepared for each of the streams identified. These are supporting documents to this project charter. Consultant would like to remind customer that this is a time and materials contract and that these scopes have been created to get a better understanding of what work needs to be performed and the resources necessary to perform them. These scopes will be used to guide the prioritization and time spent by the consultants contracted to customer.

D esign and B uild  ◂ 

25

Scoping Documents The following items will be included within the scope of the current project. Customer high-level scopes for DWP consultancy Customer scope management for facilitation and coordination of the design and equipment of collaboration facilities Customer scope management for DWP business case Customer scope management for IT strategy Customer scope management for DWP knowledge transfer opportunities Customer scope management for integration of competence management Customer scope management for verification and alignment of the organizational strategy Customer scope management for alignment of the business case with CPA recommendations Customer scope management for advice on the implementation of the CPA

Acceptance Management The acceptance management process is a series of steps that define the key work products requiring acceptance, define corresponding acceptance criteria, develop mechanisms for demonstrating that requirements have been met, and solicit acceptance. This process is iterative in that it will be repeated until the authorized client representative accepts the final system. Acceptance Criteria Detailed within each scoping document are the stipulated acceptance criteria. Each project stream will be closed off by the signing of a project stream closure document. Project Deliverables Detailed within each scoping document are the desired deliverables for each project stream. Project Roles and Responsibilities Listed are the people directly involved in the project. Project Team Below is the proposed team to implement the work. Based on the scoping exercises, time frames, and amount of work required, it is estimated that over the

26 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

entire length of the project, between 1.5 and 2 FTEs will be required to complete the required work. Based on very high-level estimates, it is foreseen that for the first two months of the project, considerably more hours will be consumed than the 1.5 FTEs and that as much as 2.5 FTEs will need to be used. Estimated Project Time in Hours

May

June

July

August

Sept

Oct

Nov

Dec

Total

400

400

256

384

260

260

260

2220

Based on the information received and detailed scoping undertaken, we recommend the following team. These resources will be allocated into the agreed project plan. Name

Role and Responsibility

Joe Stanley

Project management Facilitation and cooperation of collaborative facility Coordination of DWP knowledge transfer

Dr. Brian Taylor

Review of the content and industry acceptance of the business case Final review of all documentation

Dr. Jay Pulgino

Business case content developer Focus on DWP and integrated network management Content management—solution consistency Deep industry knowledge and business case development experience

Mike Sall

Business case—lead driver Expert with particular focus on DWP and integrated asset management Operational strategy and business transformation Business process development

Kevin Torres

Business case development Strategy development and implementation Performance management Change management

Ricky Stoll

IT strategy IT architecture Strategic evaluation

D esign and B uild  ◂ 

27

Customer Steering Committee ■■

Team leader

■■

Processing and engineering

■■

IT coordinator

■■

Business units coordinators

Other Potential Customer Contributors In addition to the individual already mentioned, the following individuals from customer have been identified as potentially being involved in the project in some way: [list all individuals]. PROJECT MANAGEMENT APPROACH An effective project management methodology will help the consultant deliver the software solution in the fastest, most cost-effective way. Communication Management A good project communication procedure is a critical success factor for managing the expectations of the customer and the stakeholders. If they are not kept well informed of the project progress, there is a much greater chance of problems and difficulties arising, due to differing levels of expectation. The intent of communication management is to ensure timely and appropriate generation, collection, dissemination, storage, and disposition of all projectrelated information. The project communication will be primarily through a portal page. All documentation and notices will be posted on this site. This is the central point of communication for the project. Project Meetings Full team project meetings will occur every two weeks at the customer office. Team members are expected to be present at these meetings. Project Reporting All team members are expected to complete weekly status reports and ensure that it is mailed to the team by 11 a.m. on Monday morning.

28 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

CHANGE MANAGEMENT Change is an inevitable part of everyday business, particularly in projects. If not properly planned for, change factors can have a devastating effect on a project’s successful outcome. Change management is the formal process through which changes to the project plan are approved and introduced. A change can be any addition, deletion, or enhancement of the original agreed scope. Change Management Process This change management process needs to be followed if changes to the scope are to be accepted: ■■

Review and agree on the criteria for prioritizing the change.

■■

Initiate the change management procedure.

The purpose of this activity is, on receipt of a change request, to validate and prioritize it. Process or Defer Change A change may be deferred if the affected project work product is in a future phase for which detail definition and planning has not been done. It is important to be clear about how priorities are assigned and how the priority affects the speed with which the change is processed through the remaining change management steps. Perform Impact Analysis The purpose of the impact analysis is to quantify and evaluate the costs and benefits of the proposed change to the project. Steps include: ■■

Determine analysis strategy.

■■

Define alternatives.

■■

Gather analysis data.

■■

Develop impact for each alternative.

■■

Review analysis with project team.

■■

Gather additional impact data.

■■

Document impact data.

D esign and B uild  ◂ 

■■

Review with project team.

■■

Make final change decision.

29

The purpose of this activity is to obtain approval to implement the change or not. Update Change Log Enter the disposition of the change into the project change log on the SharePoint. Obtain Approval of Relevant Project Documents Obtain approval of the updated project charter document from the steering committee. Distribute the updated project documents to all interested parties. CHANGE MANAGEMENT COMMUNICATION PROCEDURE Consultant will give written response to change orders within five working days from when consultant receives such orders. The response will contain information on eventual effects to the price and milestone plan. For the change order to take effect, the customer must accept the consultant’s response in writing within five business days following its receipt. If consultant does not receive a written acceptance from the customer within the time frame specified, the consultant’s written response will be deemed rejected and consultant will have no obligations other than to continue the project as initially agreed between the parties. RISK MANAGEMENT The risk management process consists of a risk identification, assessment, quantification, response, and controlled execution. By using the risk management processes and tools, risks or events that both consultant and customer need to be aware of that may affect the project can be identified as early as possible. Identified risks are evaluated to determine their impact and likelihood of occurrence. Response strategies are established to monitor potential risks and to manage their impact if they occur at any time during the execution of the project. RISK MANAGEMENT PROCESS It is fundamental for a successful project to identify risks and the corresponding risk mitigation strategies. The risk management process helps identify high-risk factors, such as situations where the business benefit or the scope of the project

30 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

is poorly defined or the project sponsor is not identified or is not very enthusiastic. These situations could be potential problems, and it is critical to tackle them by conducting risk management activities. Step 1: Identify Risks Generally, any team member can identify a project risk at any time and in any form during the project life cycle. Every project stakeholder is responsible for actively seeking to surface previously unidentified risks. The project team, client, and subcontractors will regularly participate in both formal and informal risk identification activities. For example, formal risk identification and assessment activities are conducted internally at the time an estimate is prepared and again in each project phase. During project execution, the project team, client, and subcontractor will regularly discuss risk during the recurring project status meetings. Any conversation, formal or informal, has the potential to surface risk. Step 2: Assess Risks Once the team identifies a risk and determines whether it is within the control of the development team, the client, or neither, the risk is assessed by evaluating the probability that it will occur and the potential impacts if it does occur. The potential impact of a risk is typically expressed in terms of the consequence to the schedule and/or budget and/or scope. The assessment of risk probability and impact is based on the project team members’ experience, judgment, and understanding of the project. Step 3: Respond to Risks Once project risks have been identified and assessed, the project team must decide how to respond to each risk. Responding to risk includes selecting which risks will be responded to; the decisions to accept, mitigate, transfer, or share the risk; and the appropriate planning. Step 4: Communicate Risks At every step in the project life cycle, the project team manages risk by identifying, assessing, and responding to it. However, the process can be effective only if the project team discusses risk proactively with the client steering committee.

D esign and B uild  ◂ 

31

ISSUE MANAGEMENT AND ESCALATION PROCEDURES All projects generate issues, and without a formal issue management process, these issues can quickly escalate. Any situation that remains unresolved beyond an agreed-on period of time will be tracked as an issue. This could be a problem discovered by the technical team, a missed milestone, a failed acceptance test, a change order disposition that cannot be agreed on, and so forth. Issue Management Process An issue is a problem that, when not resolved, is certain to have an impact on the project outcome. The impact can relate to scope, schedules, costs, resources, quality, or client. Issue management has four steps: identification, review, escalation, and resolution. Step 1: Identify the Issue When an individual is unable to resolve a problem, he or she initiates the issues management process by submitting information about the issue to the project manager. Anyone within the project team, the user community, stakeholders, or contractors can submit an issue. This is to be done in writing, either on paper or in electronic format. The project manager assesses the issue and determines whether surfacing the issue to the project team is the proper course of action. In some cases, the project manager may choose to initiate a change request or address the issue as a risk. Step 2: Review the Issue The project manager reports on all open issues at the regular project status meetings. The team will: ■■

Make recommendations for the issue’s resolution.

■■

Estimate the cost and time to resolve the issue.

■■

Identify staff to implement the resolution.

■■

Update the project plan with tasks related to the issue’s resolution.

■■

Escalate the issue to management (if necessary).

The project manager tracks all action items until they are resolved.

32 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

Step 3: Escalate the Issue to Management for Review The team should consider escalating the issue to management only after members have reviewed the issue and have been unable to reach resolution. If the project team elects to request management to participate in the problemsolving process, the project manager notifies the management representation identified in the project charter document. This notification includes a description of the impact and the timeline for resolution. Management representation should support resolution within the timeline or assess an alternative to keep the project moving forward until a permanent resolution can be implemented. Step 4: Resolve the Issue When an issue has been resolved, it is closed. The project manager should follow all appropriate processes to update contracts and affected documents. Escalation Consultant’s approach to issue management is that the timely resolution of issues is critical to maintaining control of the engagement, achieving the engagement schedule and costs, and maintaining high client satisfaction. The purpose of the escalation process is to ensure that issues and problems are properly managed and resolved in a timely and efficient manner; it is not to place blame. The escalation process provides a mechanism to alert higher levels of management to those issues not being resolved.

NOTE 1. Guy Kawasaki, Enchantment: The Art of Changing Arts, Minds, and Actions (New York: Portfolio/Penguin, 2011).

C H A P T E R

3

Data Acquisition

How data integration experts work to integrate data ready for advance analysis

Problem identification and definition

Design and build

Feedback

DATA ACQUISITION

Exploration and reporting

Actionable analytics Analysis

33

34 

D

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

ata preparation typically involves data selection from different sources. Integration from different sources is a critical timeconsuming step. Data are rarely centrally available, summarized, and/or ready to be consumed in advanced analytical tasks. Data integration involves combining data residing in different sources and providing users with a unified view of these data.1 This process becomes significant in a variety of situations. For instance, megaresorts must pull data from their casinos, retail, restaurants, golf course, hotel, and so on to conduct effective marketing campaigns. A concept that appears as a result of bringing data from different resources is the need to transform the data in such a way that they are intelligible when put together in a data warehouse, for example. Metadata are data about the data. In simple words, metadata are a dictionary of the data contained in a particular system, complete with definitions and transformations among other characteristics. A data transformation converts a set of data values from the data format of a source data system into the data format of a destination data system. Data transformation for system needs can be divided into two steps: data mapping and extract, transform, and load (ETL) code generation. In addition, data often is transformed during the statistical modeling stage. Mapping of data element to data element frequently is complicated by complex transformations that require one-to-many and many-toone transformation rules. The code generation step takes the data element mapping specification and creates an executable program that can be run on a computer system. Code generation also can create transformation in easy-to-maintain computer languages such as Java. The output of these transformations and the associated metadata documenting the process is an analytical master data set suitable for a particular analysis. Data integration challenges appear with increasing frequency as the volume and the need to share existing data increases. It has become an even more important subject dominating the technology as large volumes of data continue to grow in corporations, governments and other organizations. This technology environment is referred to as big data.2

D ata A c q uisiti o n  ◂ 35

I flew into London Heathrow airport on a jumbo jet late in the winter. I was there on business, big business, Al Bond style. I slid through immigration and jumped on a train up to the Midlands. I mulled over everything I wanted to speak to my pals about. Data integration involves combining data residing in different sources and providing users with a unified view of these data. Metadata are data about the data: a system component that appeared as a result of bringing data from different resources. There is a distinct need to transform the data in such a way that they are intelligible when put together in a data warehouse. There are different kinds of metadata. Structured metadata make up a dictionary of the data contained in a particular system, complete with definitions and transformations, among other characteristics. This is the metadata most critical to facilitate the discovery of relevant information for predictive analytics. As we pulled into the station, I thought to myself, “What will data transformation be capable of in the future?” Of course, today it converts a set of data values from the data format of a source data system into the data format of a destination data system. Data transformation for system needs can be divided into two steps: data mapping and ETL code generation. Mapping of data element to data element frequently is complicated by complex transformations that require one-to-many and many-to-one transformation rules. Data often suffers additional transformation during the statistical modeling stage. But what more? The train station bustled. I made my way out onto the charming streets and took in a breath of drizzly, cold air. Gloomy gray clouds bunched up in a distant corner of a wet sky. I wandered the cobblestone streets of Shropshire and walked into a restaurant, thinking, “What is really new about big data?” Sure, there is the quadruple V factor: velocity, volume, variety, and value; and, of course, the new dimensions around actionable, adequate, and accurate. But, again, what is truly revolutionary about big data? Glancing at my watch, I drifted off into the wild world of metadata analytics. Suddenly I spotted my old friend Dr. Thomas Bullock across the room. He beckoned me over. “Come have a look at this, Al.” It was as if he had magically read my flashy forward-thinking mind; amid the busy flurry of ATMs, mortgages, retail pricing categories, electronic health

36 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

records, monetary exchange, call detail records, and megaresort hotel and casino data lay mathematical genius.

DATA: THE FUEL FOR ANALYTICS Tom had drawn something on a napkin. His simple diagram spelled out the process of SMF records, system management facilities, a standardized method for writing out records of activity to a file. His pictographic jpeg reminded me of an ancient form of collecting data. These boardroom charts were a twenty-first century version of elite knowledge, the next step on the trajectory laid out by the hieroglyphs of the Egyptians or the quipus of the Inca. I could see the well of knowledge and wisdom bubbling up all around us. I told Tom my bright predictions for the future, “I think the biggest leaps forward for business, science, and society will come from insights gleaned through perpetual real-time analysis of big data.” He smiled. “Yes. With billions of people on the Internet, there is a greater diversity in the forms and shapes data are taking. Transactions of every kind are becoming infinitely richer.” I raised my hands. “Can you believe that 30% of all of the data in the world consists of medical images?” Tom opened his eyes wide. “Wow, that’s incredible! I’m not completely surprised. With the whole planet covered in sensors, some say approximately a billion transistors for every human, more data are being generated than ever before. It is interesting that pictures of our bodies account for so much of it though.” Tom and I work together from time to time as data scientists for industry. He knows as well as I do that while data are growing at an exponential rate, time is not. No company, city, or country, for that matter, can afford to lose the power of big data. I nudged him on. “So, is our goal to remain competitive or to change the world?” Tom gave a sideways smirk. “Well, either way, mate, we have to get analytics at the core of our thinking.” The key to big data is organizing the information in a way that puts it into context and makes it useful. Through smarter data, we can make sense of information in all of its forms. For instance, a commuter

D ata A c q uisiti o n  ◂ 37

train company can weigh tens of thousands of variables at a time—the rolling stock, the changing weather patterns, who is on board, who is not—and then assemble and schedule thousands of trips. That is not surprising. The extraordinary part of this cycle is that by analyzing all of the goings-on of a company over time, gathering the big data of the whole picture, companies are able to improve operating efficiency significantly, have happier customers, and save millions of dollars. Tom continued, “I mean, obviously, any data point by itself is useless.” I agreed, “Often it is all about context in real time. Capturing connections and making smart systems.” Tom shrugged. “I have been working on some new computational models that stream everything live rather than rely on snapshots of the past. I think the method offers enormous hope in fields like the stock exchange where every second counts.” I interjected, “But the whole field raises important concerns related to privacy and security. What a challenge that is! On the whole, I agree though. The data’s hidden meanings are an incredibly valuable resource that must be harnessed in real time.” We noticed the drizzle letting up a bit and decided to go for a stroll. As we walked the streets of Tom’s small town, I grew increasingly excited for the future. A few hours later we found ourselves seated in a small fish and chips shop, a bit off of the beaten path. “Think of data production, gathering and organizing data, as going hand in hand with data mining. Imagine a pristine mountain of data just ready to get jackhammered and sorted out by us!” Tom enthusiastically cheered. “This is the new frontier! First, we go prospecting. Walk through the jungles of data, attacked by footlong mosquitoes, until we finally discover the gold of our dreams! Visualize the data held in its massive mountainous form,” I exclaimed. “A-sampling we shall go, mate!” Tom continued. “Tell me more about these new big data concepts you’ve been working with, Al.” I told him, “Parallel and distributed computing is not new exactly, but it is constantly being revolutionized. The focus is on comprehensive and theoretically sound treatments of parallel and distributed numerical methods. I focus on algorithms that are naturally suited for

38 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

massive parallelization and the fundamental convergence, rate of convergence, communication, and synchronization issues associated with such algorithms.” “What do you think about the NoSQL (not only SQL) revolution?” I said, “It is amazing. Of course, it has downsides as well as advantages. One problem, for instance, is the huge computational cost of executing a JOIN between tables. NoSQL is tossing functionality away for speed, but it sure is fast!”3 Tom chuckled. “I like MapReduce. It is still an insufficiently understood paradigm for large-scale, distributed, data-intensive computation. The variety of MapReduce applications and deployment environments makes it difficult to model MapReduce performance and generalize design improvements.”4 I showed Tom a Hadoop ecosystems graphic I often pondered over. “Ah, yes!” he exclaimed. “Hadoop ecosystems do make managing large unwieldy amounts of data easier. Al, how do you go about teaching corporations about big data and all of these different concepts and tools?” I flashed him a grin. “Well, Tom, in most corporations, the executives are making billion-dollar decisions based on 20% of the data they collect. In other words, most corporations have about 80% unstructured data. I start by helping companies find the technology that will best help them to harness their wild data.” Tom nodded vigorously, his spectacles sliding down his nose. “So, by capturing and analyzing data in an automated manner, companies can make better decisions?” “Exactly,” I replied. “I ask them a few simple questions, like, how does the need to analyze unstructured data relate to big data? How do we turn unstructured data into analytical data? And then they ask me the big question: Where does the business value lie?” Tom nodded enthusiastically again. “Yes, I have a problem right now that I’d like to ask you about. How can a corporation know what is in its massive collection of contracts? In other words, how can it access the important information collectively buried in the data mountain?” “Oh, I can definitely help you get on the road toward creating an analytical database.” I gave him the double thumbs-up. “There are loads of companies with these kinds of problems. For instance, right now I am working with someone making a collection of their undocumented

D ata A c q uisiti o n  ◂ 39

legacy systems. They need to modernize, and I know the way to automatically document the old legacy systems and create a new database where old legacy applications can be analyzed and searched automatically.” Tom seemed to be following what I was saying. We stepped outside of the shop. The sun was breaking through the clouds a bit, and Tom looked up at the sky wistfully, “I think I’ll go for a run.” “Sounds good. But before you go, why is it that healthcare data are so hard to get? I mean, what can be done about automating the textual data found in the healthcare environment?” Tom began to stretch. “Well, I guess the real question is can we really create an analytical database from raw text?” I threw my hands up. “That is a great question! For example, how can we make sense of slang and shorthand in medical documents? How can we read text in one language and produce analytical output in another language? These are all questions I am working to address right now with my clients.” My eyes glazed over as my mind wandered through a database management system (DBMS) graphic in my mind. “Data integration and transformation depend on the data load, right?” I smiled. “Yes. Your database management system handles the requests generated from the structured query language (SQL) interface, producing or modifying data in response to these requests involving a multilevel processing system. This is one way of accessing data.” “And what about OLAP?” “Online analytical processing? That’s another way. OLAP helps by creating a layer of summarized information in cubes that can provide quick answers to frequently asked questions. These cubes have been stored in traditional storage like storage area netwoks (SAN). I am not sure it is going to last a whole lot longer with the advent of in-memory visualization with cheap storage in commodity hardware and standard reporting via the cloud.” “I know you don’t want me to lose the capability for complex calculations, trend analysis, and sophisticated data modeling. Do you?” Indeed I did not. We bid one another farewell and agreed to meet the next time I visited. I hailed a cab and headed back to my hotel on Blueberry Close. I was right on time for tea. I spotted another British friend across the lobby.

40 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

Dr. Warren Taylor looked up at me from his milky Earl Grey. “Hello, Al!” I took a seat with him and let the waitress pour me a steaming cup, saying: “I just came from a very exciting meeting with Tom Bullock.” Warren nodded. “Nice to hear you’ve seen Tom. What kind of data were you gentlemen dreaming about?” “Reporting, predictive modeling, real time and more, Warren!” I replied. “Keeping storage in mind as well, of course, I have been thinking about this quite a bit recently. We have all kinds of new ideas for warehouses that store multidimensional data, whole architectures of different models and schemas.” I had him interested now. His eyebrows lifted to his receding hairline. “Actually, I have been meaning to ask you about data security.” “Definitely! Forefront that concern, my friend! Our first priorities are always data quality, data governance, and data consistency.” “Yes. I have been hearing a lot about these troublesome toxic data waves crashing into big data power.” He winced. As we sipped our tea, I churned all of these disparate ideas in my head. To obtain a complete picture of the advanced analytics process and how it contributes to business value for scientists like Warren Taylor, I would have to use my years of real-world experience to explain big data. “OK, data acquisition is one of the main tasks involved in predictive analytics, right? And in terms of data quality considerations, there are database utilities and codes that allow programmers to extract data. Remember garbage in, garbage out?” Warren looked up. “And these are utilities developed over time?” “Yes, loaded with rich information about the organization’s operation. There are sometimes these metadata saved in a central location. And typically the extraction code is later reviewed by a professional code reviewer. This prevents the extraction from being misleading. Just as you were wondering, this is a critical step because of questionable information attempts against the integrity and accuracy of the results.” Warren sighed. “Great, I’m glad to hear it! Of course, the credibility of the data (and the results) is the fundamental basis of a predictive system.” He looked out at the window as the clouds began to cover the sun once more. “I have a seed left from my last project that I want to explore.”

D ata A c q uisiti o n  ◂ 41

I smiled. “That’s great, very important for programmers like yourself. Coding standards enforced at code libraries give programmers the ability to locate previous programming authors and let them discuss previous results easily.” “Very interesting. I have a DQR, a data quality report, that provides basic information on data sets. I often check that to make sure data are being used properly and the selection criteria have been enforced accurately.” Warren winked at me.

A DATA SCIENTIST’S JOB I winked back. “DQRs also provide information on missing values. Warren, your job as a new data scientist requires a lot of the knowledge and skills of a solutions architect. Your experience in multiple hardware and software environments makes you at home with complex, heterogeneous systems that most people can hardly imagine.” “Well, Al, I don’t know if I would say all of that. I’m no highly seasoned senior technocrat like you.” I gave a hearty chuckle. “We have worked through the software development process together more than a few times!” Warren smiled. “I guess I have experimented with a few systems development life cycles in my day. But a true SDLC scientist like yourself has to work with developers like me to ensure proper implementation. You are the link between me and the needs of the organization.” “You are mistaken, my friend. With my help, I can position you as the bridge between enterprise architects and application architects. Sure, an enterprise architect’s deliverables are more abstract that those of the application or solution architect, but the main distinction lies somewhere else. It is in their motivation. What do you think the other differences are between the two?” Warren rubbed his jaw. “Not quite sure. The enterprise architect is employed to design, plan, and govern strategic rationalization and optimization of an enterprise’s services and components. A solution architect just helps program and project managers to a varying degree.” I agreed with him for the most part. “Depending on the funding model. Where a solution architect starts and stops depends on the process of solution identification and delivery. For example, let’s say an

42 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

enterprise may employ a solution architect on a feasibility study. Or a supplier may employ one at bid time before any implementation costs. Or maybe both will employ a solution architect to govern an implementation project. There are so many possibilities!” “What about an information technology services provider?” I snapped my fingers, then pointed at him. “Surely they might employ a solution architect to focus on operational services rather than implementation projects. Any solution where understanding managed operations is important.” Thinking over what I had just said, he replied, “Certainly someone like me could be responsible for coordinating and working on a bid to supply services.” I narrowed my eyes. “Essentially, business planning and general management take ownership of a particular solution offering. We develop and execute a solution strategy and business plan that support growth. You are excellent at shaping and planning service lines. One area you might want to work on is product marketing.” Warren smiled glumly. “What about subject matter enterprise?” “Acting as a visionary and strategist is one of my strong suits.” I beamed. “I suggest surveying the market landscape for insights, direction, vendors, and methods first. That will provide the expertise to identify and translate system requirements into software design documentation. Work with technical writers to ensure quality internal and external client-oriented documentation.” “Sometimes I have trouble speaking at trade conferences and feel too shy to seek authorship opportunities in trade publications.” My jaw dropped. “Who you?” I smiled. “Focus on what I already mentioned then, business development. By helping marketing departments develop marketing material and position strategies for product area, in conjunction with overall marketing message frameworks, you will succeed.” Warren looked optimistic. “This sounds amazing! But what about methodology and quality assurance? How can we integrate all of this together?” “I suggest leading the development of formalized solutions. What do I mean? Building and maintaining seed repositories for deliverables,

D ata A c q uisiti o n  ◂ 43

methodologies, and business development documents. Interface and coordinate tasks with internal and external technical resources. Collaborate with project managers and directors to provide estimates, develop plans, and implement installation and integration.” Warren ran his hand through his hair, noticeably changed by our conversation. “Al, how can I make myself the person to oversee aspects of the project life cycle from initial kickoff to requirements analysis? The head honcho.” “By designing and implementing projects within the solution area. Providing quality assurance for services within the solution area. Writing in the solution area. You see where I’m going with this!” Warren snickered. “Point taken, mate. So, with your workforce management techniques, I will be able to supervise a team of direct reports who drive service lines into the almighty solution area?” “Yes!” I looked down at my watch. I was out of time. “We will have to continue this later. I have to get going. Cheerio!” I shook his hand, took a last sip of tea, and headed off into a world full of millions of billions of moments and molecules becoming big data.

NOTE FILE: SAMPLE SEED DATA QUALITY PROJECT When considering a data quality project, professional services will look at the infrastructure and staging areas as well as how the core data content has been structured and designed. Consultants may examine the ad hoc reporting, bulk report distribution, data provisioning to data marts, specialized analytical platforms, and your data warehouse. After the analysis, professional services experts will tell you where you could improve your architecture design, scale up your platforms, and ultimately invest in the future. The main procedures required for the data quality project are: ■■

Data profiling

■■

Entity resolution

■■

Standardizing

■■

Data enrichment using quality knowledge base

■■

Monitoring

44 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

Data Profiling Data profiling is vital in identifying the root cause of poor-quality and disparate data sources. Data profiling is a primary component of the consultants’ strategy to identify duplicate records and rationalize information into one master record. The main procedures for profiling are: ■■

Develop a complete assessment of the scope and nature of data quality issues.

■■

Create an inventory of data assets.

■■

Inspect data for errors, inconsistencies, redundancies, and incomplete information.

■■

Build the foundation for future data management initiatives.

Entity Resolution Entity resolution measures the degree of similarity between two data elements, often based on weighted matching of records. By defining these unique instances, the information can be assigned to a single, consolidated record or flagged for manual intervention or further processing. The consultants will: ■■

Identify the records across multiple data sources from incomplete and nonobvious relationships.

■■

Implement the entity resolution routines through advanced fuzzy ­matching technology.

■■

Create multirecord clusters, confidence scores, and scatter plots to determine potential clusters.

■■

Analyze the suitability of data elements as potential identifying attributes.

■■

Recognize when slight variations suggest a connection between records.

■■

Integrate data into a single warehouse or maintain separate sources with relationships maintained in a single master record.

Standardizing The consultant will define the standard of the data to be governed. This is a critical success factor in implementing data governance and in business integration with data quality. There are two main areas for standardization: 1. Standardizing data names and data definitions so that data carry consistent and unambiguous meaning with them everywhere they go.

D ata A c q uisiti o n  ◂ 45

2. Standardizing data content by managing data quality throughout the enterprise to profile, validate, clean, and complete data content. DATA ENRICHMENT USING QUALITY KNOWLEDGE BASE A quality knowledge base (QKB) has a single data classification and transformation engine with thousands of prebuilt data quality rules, grammars, and vocabularies. These rules help to verify and improve the quality of name and address information. A good example of this functionality is DataFlux®. DataFlux QKBs use geographic and language-specific rules and standardization conventions to manage countryspecific address data. Each QKB has built-in data quality algorithms for common data types (such as name, address, or phone standards) that can be customized and extended by users. This step of the data quality process will: ■■

Enrich the internal address data.

■■

Create accurate reports and analytics on customers, both internally and for compliance requirements, and add value to data on materials, products, and services.

■■

Substantially reduce undeliverable mail and reduce associated costs.

Monitoring Data monitoring technology uses an advanced service-oriented architecture to expose data quality rules as Web services to enable ongoing, accurate information. These rules will be used with the existing framework, providing regular status checks of data governance procedures. The implementation tasks are: ■■

Design and enforce rules to determine if data are maintained within proper control limits and meets predefined business rules.

■■

Create data alerts and controls to verify that data remain in compliance with internal and external data policies.

■■

Allow customers to react to data problems quickly, before inaccurate or invalid data negatively impact the business.

■■

Create customized business rules to validate and audit operational processes.

■■

Enable enterprise governance, risk, and compliance monitoring.

46 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

NOTE FILE: ABOUT HADOOP INTRODUCTION The fuel for analysis is data. In the past, this meant structured data created and stored by organizations themselves, such as customer data housed in customer relationship management applications or transactional data stored in enterprise resource planning systems. However, the volume and type of unstructured and semistructured data now available to enterprises are growing rapidly from the accumulated traditional sources, as well as new sources such as social media and networking services, sensor and networked devices, and machine- and humangenerated online transactions. These make up what are called today big data. These new massive data volumes created the need to develop new forms of data storage and the corresponding data management, visualization and analysis. The early pioneers of big data were the largest, Web-based, social media companies—Google, Yahoo!, Facebook—it was the volume, variety, and velocity of data generated by their services that required a radically new solution. Volume is just one key element in defining Big Data. The other two are variety and velocity. Variety refers to the many different data and file types that are important to manage and analyze more thoroughly, but for which traditional relational databases are poorly suited. Velocity is about the rate of change in the data and how quickly it must be used to create real value.5 WHAT’S DIFFERENT IN DATA MANAGEMENT? Hadoop is the Apache open-source software framework, which means it includes a number of components for processing, storing, and analyzing massive amounts of distributed, unstructured data. Originally created by Doug Cutting at Yahoo!, Hadoop was inspired by MapReduce, a user-defined function developed by Google in early 2000s for indexing the Web. It was designed to handle petabytes and exabytes of data distributed over multiple nodes in parallel. Hadoop clusters run on inexpensive commodity hardware (i.e., blades) so projects can scale-out inexpensively. Hadoop is a project of the Apache Software Foundation, where hundreds of contributors continuously improve the core technology. Here is a brief description of five of the top Hadoop components: 1. Hadoop Distributed File System (HDFS™) is the primary storage layer used by Hadoop applications. It is a distributed, scalable, Java-based file system adept at storing large volumes of unstructured data.

D ata A c q uisiti o n  ◂ 47

2. MapReduce is the compute layer of Hadoop. MapReduce jobs are divided into two parts. The “Map” function divides a query into multiple parts and processes data at the node level. The “Reduce” function aggregates the results of the “Map” function to determine the “answer” to the query. 3. Hive is a Hadoop-based data warehouse developed by Facebook. It allows users to write queries in SQL, which are then converted to MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools. 4. Pig is a high-level data flow language (Pig Latin) and compiler developed by Yahoo! that creates MapReduce jobs for analyzing large data files. 5. Apache HBase is a distributed scalable nonrelational database that has random, real-time read/write access. The database allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts, and deletions.6 HADOOP: ADOPTION PROS AND CONS The main benefit of Hadoop is that it allows enterprises to process and analyze large volumes of unstructured and semistructured data, otherwise inaccessible to them, in a cost- and time-effective manner because Hadoop clusters can scale to petabytes and even exabytes of data. It is also inexpensive to get started with Hadoop. Developers can download the Apache Hadoop distribution software for free and begin experimenting with it in less than a day. The downside to Hadoop and its myriad components is that they are immature and still developing. However, one of the most pressing barriers to adoption for big data in the enterprise is the lack of skills around data science. Many difficulties also arise from organizational misconceptions, moving away from traditional methods, the shortage of programming resources, and the inability to change. Like any other change, big data requires big adjustments. The good news is that some of the brightest minds in IT are contributing to the Apache Hadoop project, and a new generation of Hadoop developers and data scientists are coming of age. As a result, the technology is advancing rapidly, becoming both more powerful and easier to implement and manage. HOW HADOOP WORKS A client accesses unstructured and semistructured data from sources including log files, social media feeds, and internal data stores. Hadoop breaks the data up

48 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

into parts, which are then loaded into a file system made up of multiple nodes running on commodity hardware (blades). The default file store in Hadoop is HDFS. File systems such as HDFS are adept at storing large volumes of unstructured and semistructured data as they do not require data to be organized into relational rows and columns. Each part is replicated multiple times and loaded into the file system so that if a node fails, another node has a copy of the data contained on the failed node. A Name Node acts as facilitator, communicating back to the client information such as which nodes are available, where in the cluster certain data resides, and which nodes have failed. Once the data are loaded into the cluster, they are ready to be processed via MapReduce. First, the client submits a “Map” job—usually a query written in Java—to one of the nodes in the cluster known as the Job Tracker. The Job Tracker refers to the Name Node to determine which data it needs to access to complete the job and where in the cluster that data are located. Once determined, the Job Tracker submits the query to the relevant nodes. Rather than bringing all the data back into a central location for processing, processing then occurs at each node simultaneously or in parallel. This is an essential characteristic of Hadoop. When each node has finished processing its given job, it stores the results. The client initiates a “Reduce” job through the Job Tracker in which results of the map phase stored locally on individual nodes are aggregated to determine the “answer” to the original query, then loaded on to another node in the cluster. The client accesses these results, which can then be loaded into one of a number of analytic environments for analysis. The MapReduce job has now been completed. The next phase is analysis. Once the MapReduce phase is complete, the processed data are ready for further analysis by data scientists and others with advanced data analytics skills. Data scientists can manipulate and analyze the data using any number of tools for any number of uses, including to search for hidden insights and patterns or to use as the foundation to build user-facing analytic applications. The data can also be modeled and transferred from Hadoop clusters into existing relational databases, data warehouses, and other traditional IT systems for further analysis and/or to support transactional processing. TYPICAL HADOOP APPLICATIONS Part of what makes Hadoop and other big data technologies and approaches so compelling is that they allow enterprises to tackle new questions and frontiers.

D ata A c q uisiti o n  ◂ 49

This exploration can result in insights and innovation that may lead to new products or organizational improvements. Some of the use cases are: ■■

Online recommendation engines. To match and recommend users to one another or to products and services based on analysis of user profile and behavioral data

■■

Sentiment analysis. To determine the user sentiment related to particular companies, brands, or products

■■

Risk analysis. To analyze large volumes of transactional data to determine risk and exposure of financial assets

■■

Fraud detection. To detect fraudulent activity

■■

Marketing campaign analysis. To monitor and determine the effectiveness of marketing campaigns

■■

Customer churn analysis. To identify patterns that indicate which customers are most likely to leave for a competing vendor or service

■■

Social network analysis. To determine which customers pose the most influence over others inside social networks

■■

Customer experience analytics. To integrate data from previously siloed customer interaction channels, such as call centers, online chat, Twitter, and so on, to gain a complete view of the customer experience

■■

Network monitoring. To collect, analyze and display data collected from servers, storage devices, and other IT hardware to allow administrators to monitor network activity and diagnose bottlenecks and other issues

■■

Research and development. To comb through large volumes of text-based research and other historical data to assist in the development of new products

OTHER BIG DATA STORAGE TECHNOLOGIES Hadoop was derived from Google technology and put into practice by Yahoo! and others. While Hadoop is associated with big data, it is just one of three classes of evolving technologies to store and manage big data. The other two classes of big data storage solutions besides Hadoop are NoSQL and next-generation data warehousing NoSQL means “not only” SQL because these types of data stores offer domainspecific access and query techniques in addition to SQL or SQL-like interfaces. Technologies in this NoSQL category include key value stores, document-oriented databases, graph databases, big table structures, and caching data stores.

50 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

The specific native access methods to stored data provide a rich, low-latency approach, typically through a proprietary interface. SQL access has the advantage of familiarity and compatibility with many existing tools. Unlike traditional data warehouses, next-generation data warehouses are capable of intaking large amounts of mainly structured data with minimal data modeling required and can scale-out to accommodate multiple terabytes and/or petabytes of data. These data warehouses support near-real-time results to complex SQL queries, a very important missing capability in today’s batchoriented Hadoop. The fundamental characteristics of a next-generation data warehouse include: ■■

Massively parallel processing, or MPP, capabilities. Next-generation data warehouses employ MPP that allows for the intake, processing, and querying of data on multiple machines simultaneously. The result is significantly faster performance than traditional data warehouses that run on a single, large box and are constrained by a single point for data intake.

■■

Shared-nothing architectures. A shared-nothing architecture ensures there is no single point of failure in next-generation data warehousing environments. Each node operates independently of the others so if one machine fails, the others keep running. This is particularly important in MPP environments in which sometimes hundreds of machines process data in parallel.

■■

Columnar architectures. Most next-generation data warehouses employ columnar architectures. In columnar environments, only columns that contain the necessary data to determine the “answer” to a given query are processed, rather than entire rows of data, resulting in fast query results. This also means data do not need to be structured into neat tables as with traditional relational databases.

■■

Advanced data compression capabilities. These capabilities allow nextgeneration data warehouses to ingest and store larger volumes of data than otherwise possible and to do so with significantly fewer hardware resources than traditional databases. A warehouse with 10-to-1 compression capabilities, for example, can compress 10 terabytes of data down to 1 terabyte.

■■

Commodity hardware. Like Hadoop clusters, most next-generation data warehouses run on off-the-shelf commodity hardware so they can scaleout in a cost effective manner.

D ata A c q uisiti o n  ◂ 51

HOW BIG DATA STORAGE APPROACHES WORK TOGETHER Next-generation data warehouses are not designed to ingest, process, and analyze the semistructured and unstructured data that are responsible for the data volumes in the big data era. Hadoop excels at processing and analyzing large volumes of distributed, unstructured data in batch fashion. Next-generation data warehouses excel at analyzing mainly structured data in near real time. Analysis done in Hadoop can be ported into next-generation data warehouses for further analysis and/or integration with structured data. Thus to make use of the totality of an enterprise’s data assets, a combination of Hadoop/NoSQL and next-generation data warehouses is often required. There are a number of prebuilt connectors to help Hadoop developers and administrators perform data integration, and a handful of vendors offer big data appliances that bundle Hadoop and next-generation data warehousing with preconfigured hardware for quick deployment with minimal tuning required. To fully take advantage of big data, however, enterprises must take further steps. They must employ advanced analytics techniques on the processed data to reveal meaningful insights. Data scientists perform this sophisticated work in one of a handful of languages or approaches, including SAS and R. The results of this analysis can then be operationalized via big data applications, either homegrown or off the shelf. Other vendors are developing business intelligence–style applications to allow non–power users to interact with big data. HOW SAS WORKS WITH HADOOP FOR ANALYTICS SAS programmers can use the LIBNAME statement, which makes Hive tables look like SAS data sets. In addition, PROC SQL provides the ability to execute explicit HiveQL commands in Hadoop. Finally, several SAS procedures (including PROC FREQ, PROC RANK, PROC REPORT, PROC SORT, PROC SUMMARY, PROC MEANS and PROC TABULATE) are currently supported to work with Hadoop with more coming. The SAS/ACCESS® software provides transparent data access to Hadoop (via HiveQL). SAS users can access Hive tables as if they were native SAS data sets and maximize Hadoop’s distributed process capability. Figure 3.1 depicts the SAS/ACCESS® to Hadoop component among other SAS components. SAS helps execute Hadoop functionality with Base SAS® by enabling MapReduce programming, scripting support, and the execution of HDFS commands from within the SAS environment. These abilities complement the capability that

52 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

SAS Data Integration Studio

SAS® Enterprise Guide®

SAS Web Report Studio

Any SAS Client

SAS OLAP Cube Studio

SAS/ACCESS Interfaces RDBMS Client Software or RDBC/JDBC DBMS Client Utilities (Loaders, Unloaders, etc.)

Relational Database

Data Warehouse Appliance

NonRelational

ERP

PC Files

Hadoop

Figure 3.1  SAS/ACCESS® Software Source: Copyright © 2014 SAS Institute Inc. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc., Cary, NC, USA.

SAS/ACCESS® provides for Hive by extending support for Pig, MapReduce, and HDFS commands. SAS Administrators can manage Hadoop using SAS® Information Management and SAS tools like SAS® Enterprise Data Integration Server. Using these tools, administrators have access to SAS metadata, data lineage, and security to manage the Hadoop environment. Users can work in Hadoop via an intuitive, graphic interface SAS Enterprise Guide. They can create data management and analytic code (via Hadoop’s User Defined Sources (UDS)) A core component of SAS® Visual Analytics, the SAS® LASR™ Analytic Server uses HDFS as local storage at the server for fault tolerance. This is a SAS hardware-ready software that combines the scalability of HDFS and the in-memory processing speed of a RAM-intensive clustered blade server environment.

D ata A c q uisiti o n  ◂ 53

The SAS® LASR™ Analytic Server is specifically engineered to address diverse needs of advanced analytics. It is a read-mostly, stateless, distributed in-memory server designed to solve problems previously computationally infeasible. PROC HADOOP (Base SAS®) can mix and match SAS and Hadoop processing. The SAS/ACCESS® to HADOOP engine is a read-and-write Libname engine that uses Hive to access HDFS. It is a fast loader that uses the Hadoop Streaming application programming interface. It can be used from the SAS Datastep Proc SQL, and it provides SAS Procedures support. SAS also encapsulates multiple Hadoop operations into a job flow to read HDFS files, write HDFS files, query Hadoop using Hive QL, use Map/Reduce jars, provide a wrapper for Pig Latin, and transfer external data to/from Hadoop using Hadoop utilities. These capabilities might mitigate the need for Hive programming skills. SAS also provides a metadata layer to supplement Hadoop’s access, security, audit and tracking, and data lineage. SUMMARY A current big data environment includes commodity hardware, big data enterprise Hadoop distributions, Hadoop data management components, and an analytic layer with an analytic application development platforms, as well as advanced analytics applications, an application layer for data visualization tools and business intelligence applications, and services to glue everything together (consulting, training, technical support, software and hardware maintenance, hosting services for big data as a cloud). SAS helps leverage the existing investment in SAS technologies for data management, reporting, and visualization.

NOTES 1. Maurizio Lenzerini, “Data Integration: A Theoretical Perspective,” in Proceedings of the Twenty-first ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 3–5, 2002. Madison, Wisconsin. 2. Tony Fisher, The Data Asset: How Smart Companies Govern Their Data for Business Success (Hoboken, NJ: John Wiley & Sons, 2009). 3. Peter Wayner, “Seven Hard Truths about the NoSQL Revolution,” Infoworld, October 20, 2010. Retrieved from www.infoworld.com/d/data-management/7-hard -truths-about-the-nosql-revolution-197493?page=0,3. 4. Yanpei Chen, Archana Sulochana Ganapathi, Rean Griffith, and Randy H. Katz, “A Methodology for Understanding MapReduce Performance Under Diverse

54 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

Workloads,” November 9, 2010. Retrieved from www.eecs.berkeley.edu/Pubs /TechRpts/2010/EECS-2010-135.html. 5. Douglas Laney, “Deja VVVu: Others Claiming Gartner’s Construct for Big Data,” January 14, 2012. http://blogs.gartner.com/doug-laney/deja-vvvue-others-­claiming -gartners-volume-velocity-variety-construct-for-big-data. 6. Jeff  Kelly, September 16, 2013.  http://wikibon.org/wiki/v/Hadoop-NoSQL_Software_ and_Services_Market_Forecast_2012-2017.

C H A P T E R

4

Exploration and Reporting

The world of data exploration is ruled by reporting and visualization . . . thoughts on slice and dice

Problem identification and definition

Design and build

Feedback

Data acquisition

Actionable analytics

EXPLORATION AND REPORTING Analysis

55

56 

I

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

t was 2006, and I was in India. In a daze of humidity and heat, I was wondering, “How can we implement business analytics functionality in a financial enterprise in a country like India?” As I passed underneath a tangled web of electrical wires, I heard my name. “Hello, Al!” A man was following my winding path through a crowd of brightly colored saris, moving effortlessly behind me. I turned around and walked to greet him. I had spent time with him over the phone in the past few weeks. At that moment, I thought to myself, “I’ve built a brand-new data warehouse with lots of data stored neatly inside. What am I going to do with it?” Climbing through the beautiful and ancient city of Delhi, big data are on my mind. I am here in Delhi to observe a colleague of mine, Sanjiv, teach his staff about business intelligence, a.k.a. reporting. He will be leading a class on big data, emphasizing that the first and most important use of new analytical platforms is effective visualization. Alternatively, reporting can be used to support statistical analysis, that is, modeling. I shook Sanjiv’s hand and after an exchang of pleasantries, we launched into the topic at hand. “Current reporting is definitely not our father’s reporting. I remember waiting in my office at Blue Cross and Blue Shield Association in Washington, D.C., for a huge stack of paper delivered on a hand truck in the 1980s. Today, analysts use inmemory data that delivers interactive graphs on Web portals where business analysts can slice and dice data enriched with images and graphs. They reach out to the enterprise via SMS, videoconferencing, and/or e-mail. They get their information immediately on their mobile devices like iPads and iPhones. What a difference!” Sanjiv smiled knowingly. “Yes. The Internet and mobility have revolutionized the way successful businesses communicate with their customers, vendors, and suppliers. New Web-based platforms help decision makers gain insight from their data, discover what customers want and need, and enhance product offerings for long-term profitability. More important, the distribution of information is much cheaper.” I laughed. “Every day, data managers face the challenge of managing terabytes or petabytes of corporate data. Big data projects are blooming everywhere. Information flows into corporations from

E x pl o rati o n and R ep o rting  ◂ 57

customers, vendors, and suppliers. Managing, accessing, and understanding these data are complex processes. We are drowning in data!” Sanjiv wiped his brow. “I work hard to teach my staff that competitive companies are selecting innovative solution to leverage their limited resources and to deliver fresh, relevant information and tools for understanding big data as quickly and easily as possible. Cloud technology is helping deliver high-quality data to key decision makers to perform whenever and wherever they need them. This evolving technology is providing decision makers with the information and knowledge they need to understand their customers’ experience, predict customer needs and behavior, and ultimately grow their businesses.” I added, “Definitely! Corporate acceptance of Web-based analytics is also growing. Decision makers are evaluating competitive technologies, corporate objectives, and resources and are deploying analytical solutions for applications from customer relationship management [CRM] to data mining. Cloud CRM analytics have experienced exponential growth in the last few years.” Sanjiv and I get to the hotel lobby. We sigh with relief as we enter the air-conditioned calm and head upstairs to try to find the meeting room. A colleague of Sanjiv’s, Samir, is speaking at the front of the large conference room.

VISUALIZATION Samir reads the following few lines from a piece of paper he has in his hand: “Understanding Web-Based Analytics.” He takes a sip of water. “Visual analytics can deliver standardized analytic dashboards to end users’ desktops via simple Web browsers or similar technology. Using a Web browser as a ‘viewer’ allows users to access analytical software and create interactive graphics to discover important trends in their entire data loaded in memory. In such analytic system, users access data from a central repository. Users analyze data specific to their needs and then create attractive graphics that will translate the results to associates, from senior executives to field managers.” Figure 4.1 shows a typical SAS® Visual Analytics dashboard. A hand shoots up. “For example, can marketers analyze customer feedback generated from a corporate hotline number? If so, these

58 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

Figure 4.1  SAS® Visual Analytics Dashboard  Source: Copyright © 2014 SAS Institute Inc. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc., Cary, NC, USA.

data could then be analyzed, graphically displayed, and delivered to product development, manufacturing, and senior executives who can make adjustments in real time as appropriate.” Samir responds, “Good point. To answer your question, yes they can. Let me ask another question: what are some of the advantages of visualization then?” A student replied, “A lot of people are visual. Don’t forget that a picture is worth a thousand words. Visualization offers decision makers several advantages including real-time access to data and interactive slice and dice as well as drill-down graphic capabilities.” I also raised my hand. “Furthermore, new visualization and reporting technologies help companies implement best practices and facilitate information sharing and collaboration among decision makers at low price points without compromising scalability. New reporting technology advances promise greater interactivity and flexibility, facilitating the inevitable transition from data acquisition to in-depth analysis.” Samir lit up. “Oh hello, Al. I didn’t see you back there. Yes, today data managers need to help cleanse and manage an enormous amount of data to feed reporting and visualization systems that deliver an initial set of information to corporate decision makers.”

E x pl o rati o n and R ep o rting  ◂ 59

Everyone looked back at us, and Sanjiv continued, “Yes. In many instances, a cloud computing platform offers a good opportunity for systems integration and secured multilevel access to critical information.” Samir introduces us. “These are two big data masters, Al and Sanjiv. They taught me everything I know.” I waved. “Don’t forget that users can access and visualize several databases from a centralized location or analyze sections of data, if required.” I scanned the crowd of nodding heads. Sanjiv continued, “Believe it or not, previously, analytical tools and research databases were delivered to users on a customized CD-ROM.” Sanjiv chuckled. “The new visualization and reporting technology provides decision makers with real-time data sometimes stored in memory—it has become increasingly cheap—and associated with powerful analytical tools to deliver fresh, relevant results. These tools allow the detection of patterns immediately.” Sanjiv went on, “Yes. Another significant advantage is providing a mechanism for secure delivery of data. In the past, data managers were reluctant to provide users with sensitive data in a CD-ROM format because of obvious security risks. Now data managers can protect sensitive data while allowing key decision makers access to sections of the database from a password-protected, centralized server using security protocols. Other advantages include version control and efficient processing of changes to decision makers’ analytical and data acquisition tools. This contributes to data consistency across the organization and also fosters collaboration.” Samir brought the class to an end, and I suggested to Sanjiv and Samir that we all go up on the rooftop. I could not help talking business with two of India’s brightest minds. “With Web-based applications, users can access current data to make decisions at e-speed. Increased velocity in the decision-making process enables decisions to be made more quickly and corporations to become more responsive to consumer needs. That helps organizations cope with the velocity, variety, and volume of new data arriving.” Sanjiv continued my thought. “Faster, more accurate decision making can help our companies meet emerging customer needs or avoid costly mistakes. For example, real-time data fed from manufacturing

60 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

plants can help plant floor managers make production adjustments ensuring that high-quality products are delivered to consumers. This is invaluable here in India.” Samir added, “Our senior executives can gain immediate insight regarding emerging trends or anticipate potential problems with their products and services, thus becoming more responsive to changing market conditions.” I agreed. “New, interactive technologies, such as visual analytics using in-memory capabilities, are providing decision makers more information at lower costs.” Samir smiled and asked, “How is visualization technology moving graphic presentation from a static appearance to an interactive format?” I answered slowly, “One very common way is by using Java applications, delivered via a Web browser. These applications can be programmed to interoperate with other applications and allow the user to drill down into the data or to create hyperlinked graphics to relevant documents, giving decision makers more information about each data point. This discovery process becomes faster. In reality, the visual analytics systems are question generation systems rather than answering systems. Of course, some of the questions become actionable items immediately. Otherwise it would be pure academics.” I went on, “People collaborate because background information or comments can be attached to each data point in a chart. For example, an analyst can view news releases attached to points within a timesseries chart or more detailed regional information could be attached to data points in a map. Also, you can open a visual analytics dashboard with your peers on a Webex session and get a lot of work done.” I drew on a piece of paper a model of actionable data discovery (see Figure 4.2). “This model represents the data integration needed to assemble data. It is sort of a data factory. From the factory, we extract data for a single purpose: to answer a research question. A researcher, domain expert, may take a look at the selected data and discover some interesting trends or events. At that moment, there are two main possible tracks: One, some discoveries will merit immediate action or two, some discoveries will pose new questions. New questions may need additional data, and the cycle goes on and on. That’s what this graph shows, the iterative nature of data-based discovery.

E x pl o rati o n and R ep o rting  ◂ 61

ACTION Data Discovery

New

Figure 4.2  Model of Actionable Data Discovery

Sanjiv asked, “So, could new visualization tools that are rendering graphics and relevant information on the fly allow analysts and decision makers to visualize real-time data changes instantly?” “Of course, if that’s what is needed,” I replied. “System integrators and data managers are constantly challenged to build solutions that are timely, flexible, and scalable to meet the needs of growing organizations. Delivering graphs over the Web is the perfect solution. A picture is worth a thousand words.” Samir added, “The days of fitting individual PCs with the latest software are gone. It is much cheaper and more efficient to deliver technology based on a simple Web browser, which makes it easy for IT departments to install and maintain the application and for decision makers to access the tools they need to gain insight from their data. In addition, as new employees are hired, the system has the flexibility to add new users easily and training issues are minimal.”

CLOUD REPORTING I was late for my morning class. When I arrived, Sanjiv was saying “Cloud reporting offers an opportunity to customize visualization by department or region. As a company grows, reporting can be customized to meet new organizational or directional needs. For example, if a packaged-goods company adds new plants in new countries, the reporting system can be updated within minutes.” A student asked, “As companies expand or diversify, can additional servers be added to meet new performance and scalability demands?”

62 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

Sanjiv replied, “Yes, and cloud delivery creates a familiar environment, making it easy to train new employees. Delivering reports via the cloud saves time and money for IT managers who don’t need to update individual software packages on each employee’s computer.” Another student asked, “How does the distribution of reports from a central server allow business intelligence administrators to deliver a single truth by department, geographic location, or need?” Sanjiv smiled, “This deployment method makes standardization easy for companies interested in maintaining best practices. The reporting software could ensure that high-quality products are developed and varying standards are enforced. Administrators can provide and receive feedback in the system from decision makers and make changes to the visualization software as necessary. Since changes can be deployed enterprise-wide within minutes, the BI administrator can be sure that everyone is using the same application.” The student murmured, “Wow. Cloud reporting is fast and economical, and it can help decision makers explore business results seamlessly across borders.” “Yes. For example, retail bankers can collect data and analyze results globally using a prescribed set of standards. Standardization leads to improved communication and comparison of results. Modern visualization tools help communicate these standard results to senior executives and research team members globally. Advanced visualization technology provides the ability to drill down for additional information immediately. OK, let’s break for lunch!” Sanjiv kept talking after class was done. “Al, with cloud reporting, decision makers have access to greater processing power because their scalable environment with several tools can perform analysis on the server. Greater server capacity offers decision makers greater access to advanced reporting and basic predictive analytics, which can then be distributed using the Web.” We jumped into a cab outside. Sanjiv was still talking. “As needs change, administrators can add or remove software tools from the system to provide additional functionalities and increase data storage. Greater processing and access to more powerful analytics can lead to a more in-depth understanding of the data, creating greater predictive value and better decision making. Collaboration and information

E x pl o rati o n and R ep o rting  ◂ 63

sharing are skyrocketing. Attractive, well-designed, multidimensional graphics and dashboards combined with interactive technology will improve insight and information sharing.” He continued: “Sharing results from the visualization and reporting systems to personal productivity desktop packages is not difficult anymore, and sharing results within the enterprise or with international associates is simpler using Internet-based applications.” Sanjiv started to slow down. “Cloud reporting and visualization software components are providing decision makers with greater access to data to gain insight into their customers’ buying patterns and other critical customer insight.”

NOTE FILE: SAMPLE SEED CRUISE GUEST LOYALTY An effective cruise guest loyalty solution will create an instant, interactive, and effective dialogue between the cruise line staff and their guests. The solution will break down communication barriers in the guest–staff relationship. It will also save employee time while reducing costs and more effectively managing patron issues, all important aspects in enhancing the overall guest experience. This note file describes the use of advance analytics on databases, SMS messaging, and kiosk technology to accomplish these goals. A loyalty solution will help cruise ship management in the following ways: SMS and Kiosk Communications ■■

Extend cruise offers to passengers for onboard and land activities.

■■

Track passenger satisfaction with offers.

■■

Track customer communications (marketing campaigns).

Analytics Software ■■

Accurately measure and predict cruise patron profit and loss.

■■

Maximize cruise patron profitability and overall revenue.

■■

Report and analyze key performance metrics in one centralized location (total revenue per top patrons, trips, stays), and analyze revenue for key geographic markets.

■■

Target more specific audiences.

64 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

This solution will deliver enterprise data warehouses (EDWHs), customer relationship management, business intelligence, predictive modeling, and mobile solutions. These technologies define unique IT strategies, drive systems selections, target business processes, and facilitate IT infrastructure review. This solution will include an analytic platform that integrates individual technology components into a single, unified system. The result is an information flow that transcends organizational and data silos, diverse computing platforms, and niche tools while also delivering new insights to drive increased value for a cruise organization. The solution will enable an improved customer experience because it allows cruise ship management to establish a framework to capture and intelligently manage a cruise’s collected customer data. It also provides access to companywide patron information, which allows users to generate strategic marketing campaign initiatives that improve the utilization of information in these ways: ■■

Consolidate disparate information into a comprehensive cruise patron view

■■

Store all information pertaining to a campaign in one central location

■■

Create a centralized data repository of patron information from such disparate data sources as casino management systems, cruise management systems, point of sale, retail, and other gaming and hospitality systems

■■

Consolidate information from land activities and cruise passengers

The BI platform will include a Web-enabled executive dashboard, interactive reporting based on multidimensional cubes, static reporting, as well as desktop functionality. A critical step in this process is the master data management (MDM) implementation where data from the source systems are cleansed prior to load in the EDWH. An example that may occur in the MDM step is matching cruise patrons to events, retailers, and hotel and restaurant guests. In many hospitality operations, a unique player account number is used to identify a player on the casino floor. Unfortunately, internal business processes and/or limitations of operational source systems do not allow the ability to capture this unique player account number at hotel check-in or at point of sale. Kiosks will encourage passengers to join the loyalty club. There is a need in the cruising industry for campaign management functionality delivered on a single platform. An appropriate solution should bring mobility, EDWH, analysis, and BI functionality to the cruise enterprise.

E x pl o rati o n and R ep o rting  ◂ 65

With regard to functionality, a cruise loyalty solution is driven by the points identified next. Main Features ■■

Quickly and easily identify all passengers onboard and view their statistics.

■■

Track communications with cruise passengers to ensure a constant engagement with them and the cruise brand.

■■

Identify noncarded passengers, and sign them up for the loyalty club directly from a mobile device.

■■

View passengers on special dates from a mobile device and desktop.

■■

Track communications and events for all passengers on the cruise.

■■

Track host performance.

■■

Collect and aggregate all cruise passenger data on property into a single view.

■■

Provide view of cruise passenger total spend and total worth.

■■

Enhance data model with better data sets, allowing for more detailed segmentation and better offer generation.

The implementation team will leverage the extract-transform-load (ETL) experience and deep knowledge of the media and entertainment industry’s operational source systems. Along a parallel path, the implementation team will leverage the knowledge of large, horizontal software solutions for direct marketing, campaign management, and customer contact to implement easy-to-use applications that will provide access to the data contained in the loyalty data warehouse. Functionality Included in the Platform ■■

Loyalty data warehouse engine

■■

.net architecture

■■

Advanced analytics

■■

Business intelligence

■■

Multilanguage capabilities

■■

Software support and implementation

■■

Mobile reporting

■■

Interactive kiosks

66 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

IMPORTANCE OF GUEST RELATIONSHIPS ON A CRUISE The most important thing to remember is that everybody on the cruise staff has the same common goal: to provide an outstanding guest experience and therefore obtain satisfied returns. Strong links between cruise line staff and their guests will help facilitate this goal. The guest–staff relationship can be complex, and it crosses many different areas, including guest relationships with the front desk, with the restaurant staff, with other guests, and within the cruise line community in general. Two-Way Communication Interactions between the cruise line staff and guests should be continuous and ongoing. The most valuable interactions are those that travel in both directions. This two-way dialogue makes sense: Both the staff and their guests are concerned about the guest’s positive experience; therefore, both parties will be working together to enhance this common goal. When incorporating a two-way dialogue into the cruise experience, it is important to remember that not all guests will want to be communicated to in the same manner. All methods of communication play a role, whether that communication is face to face, over the phone, using a kiosk, text messages, or e-mail. Each method has its place. Guests might need the cruise line to be a little bit more accommodating when it comes to communication. Barriers to Effective Communications A number of issues need to be addressed so a guest–staff partnership can be formed. Some common obstacles to a better partnership include: ■■

Demographic differences. The guest and staff may come from different culture or socioeconomic backgrounds or perhaps speak different languages.

■■

Role differences. Cruise line guests might have differing expectations of the cruise line staff’s role.

■■

Types of experiences. Prior cruise line experiences may have set up differing expectations.

■■

Miscommunication. Guests and/or staff may lack the ability to identify and communicate key experiences, ideas, or issues.

■■

Communication discomfort. Guests or staff may be uncomfortable about communicating their needs.

■■

Need to feel valued. Staff and guests may perceive that their perspectives and opinions are not valued.

E x pl o rati o n and R ep o rting  ◂ 67

■■

Follow-through. Lack of intelligent and effective information systems hinder the ability to track customer preferences.

How Technology Can Help Improve Communications There is tremendous penetration of cell phones in most countries, continuous advances on touch-screen technology and widespread use and growth of SMS/ text messaging. Effective solutions will leverage these trends to provide communication efficiencies to the cruise line’s guest population. Business Requirements The solution should be simple and user friendly. The solution should not add to the administrative requirements of the cruise line staff. Rather, it will make their lives easier by reducing communication time yet increasing communication interactions between the staff and their guests. The technology will apply to the vast majority of the cruise line’s guests. It does not make sense to only use e-mail in an environment where the guests do not have access to the Internet or do not readily check their e-mail. The solution will provide interactive and immediate communication. The solution will manage communications and keep an audit trail using database technology to ensure appropriate capture of passenger behavior and level of satisfaction with land and onboard activities. The solution will be integrated into the ship infrastructural constraints. Most of all, in today’s world, it should be mobile. The solution will open a communication link between the cruise line staff and their guests. This channel effectively reduces stress while enabling a daily interaction between staff and guests and also increases the social media communications among the guest population. This direct, many-way dialogue can, it is hoped, recognize potential problems and solve them before they spiral out of control. ■■

The solution should capture customer satisfaction regarding activities to close the loop and understand passenger satisfaction.

By providing the cruise line staff with the ability to manage communication regarding passenger activities using a system, the following results can be achieved: ■■

Communication time will be drastically reduced.

■■

The interaction with cruise passengers will be simple and to the point, enabling the staff to inform the passengers of any specific resort activities as well as when and where these activities will occur.

68 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

■■

The cruise line staff will have immediate and direct access to the cruise line’s computer system. The key here is that when a staff member deals with an issue, he or she can solve it within seconds due to instant access to guests’ historical information.

Data Mart Details The system will use a data warehouse structure designed specifically for the unique business needs of the cruising industry. This data warehouse structure will include data points from all the source systems identified. PROPOSAL REQUIREMENTS Standard project management and communication strategies commonly found in a project of this nature also will be utilized with this project. Specific details of these strategies will be identified as part of the project plan mentioned above. Core Resources A project of this nature typically involves participation of the following team members. The location of each resource may vary. ■■

Project manager

■■

Principal analytical consultant

■■

SAS architect

■■

BI software application consultant(s)

Costs The consulting team estimates this project to be approximately $$$ to build a prototype. This price estimate is based on previous projects and assumptions made as a result of our discussions. Additionally, the price does not include hardware costs and travel expenses. The first step is a four-week business requirement definition, which costs $$ and produces the project definition document. This is the essential first step for a successful development and implementation of a loyalty system in an iterative fashion.

C H A P T E R

5

Modeling

How analysts don’t just look at the present . . . but peek into the future

Problem identification and Definition

Feedback

Design and build

Data acquisition

Exploration and reporting

Actionable analytics

ANALYSIS

69

70 

T

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

he year was 1998. I was just about to start a new analytics job on Paulista Avenue in Sao Paulo, the main drag, a fancy boulevard nestled in between megamalls and soccer stadiums. My first day, I walked in and warmed up the crowd. I made business casual small talk with my new staff. “Good morning. How are you? By the way, one way to identify which model to develop and use is to ask business questions.” I’ve always liked to cut straight to the point. Ultimately advanced analytical systems are designed to create sound statistical models. These models are designed using standard research methods. The methods help answer critical business questions with extreme value relevance. I rounded up a Brazilian health insurance team around 11 a.m. and jumped right in. “Case management and clinicians will want to identify high-risk (high-cost) patients, identify the best intervention, determine which patients are likely to have a better outcome with an intervention, identify the factors that influence quality of care measures, predict likely diagnosis or next care need, and streamline patient assessments. Which one of these questions should we tackle first?” Sensing less excitement than I anticipated, I continued, “We can continue identifying potential targets: Underwriting and actuarial staff will want more effective price plans, predictions of patient costs based on diagnosis, medications, and test results. They will also want to identify likely fraud, waste, and abuse; edit checks and score carding; detect systemic abuse and billings for unnecessary charges; and create physician and provider profiles.” I continued, “Furthermore, the collections department will want to identify which patients are likely not to pay, identify the best action to take to get payment, and define the best strategy to contact them.” One woman asked, “Would marketing want to conduct market segmentation to understand the healthcare needs of patient populations and define programs to meet these needs?” I responded, “Yes. Marketing should be very interested in retention. They would want to identify likely churners, to help the tasks of new customer acquisition and product up-sell. They will want to learn which subscribers are good prospects for a new program. It is critical for marketing to collect customer feedback so that they understand

M o deling  ◂ 71

customer needs and gain feedback on current products and services for new product development.” The woman raised her hand again. “Why are executives interested in key performance indicators?” I explained, “They want to understand the situation and have reporting that can easily extend into their PDA/laptop/mobile devices. An important interest from executives is metrics development whereby they identify what to measure in a new program.” She kept talking. “But why does everyone want to analyze the data in a different way?” I explained, “I don’t entirely know. People approach solving problems differently, but I do know that effectively utilizing flexible, fast modeling software is a key component on how healthcare enterprises can maximize their existing analytical assets and become more profitable and efficient.” I continued, “Two typical questions for analytical applications in healthcare are: (1) What is a subscriber churn prediction model? and (2) What is a risk adjuster scoring model? “Let’s answer these two questions to illustrate how we go about developing different types of workable real-life analytical models. This way I can clearly describe what goes into the creation of a successful predictive model.” I tried to write everything on the dry erase board. Squeaking the almost dried-up black marker against the almost-white streaky surface, creating a haze of grays, I read aloud as I wrote some tasks to do.

CHURN MODEL On the board, I wrote: “Subscriber Churn Prediction and Retention Model Definitions” and the next list: ■■

Model deployment

■■

Churn prevention

■■

Subscriber segmentation

■■

Predicted subscriber churn reason analysis

■■

Campaign linkage

72 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

■■

Segment profile reports of subscriber churn rate

■■

Subscriber profiling

■■

Action plan for churn prevention

■■

Report writing

■■

Campaign assessment

■■

System administration

I said, “Some logistic elements to consider in our problem are . . .” and wrote down the next list. ■■

Where is the data needed?

■■

How to access the data?

■■

Once we have the data, how are we going to conduct model development?

I stared out into a room of eager students who were taking notes. There was a strange noise running through the dry office air, and the smell of Xerox copies was of no comfort to me. I waited for everyone to finish copying my list. “Think of the way concepts pop into analytics,” I said. “For instance, ‘churn’ is a human behavior usually associated with dropping a particular offer.” A man in the front added, “In the banking industry, churn is called customer retention. Churn may have some cultural components: In Europe, 35% to 50% of the population churns annually. Churning is a global phenomenon.” I replied, “Yes. Do you realize that new customer acquisition costs in telecommunications are up to $70 per individual? There are several reasons for churn: financial and nonfinancial reasons.” The man continued, “Is the objective of a customer retention model to predict customers who are likely to leave before they actually leave?” I replied, “Yes. You, as the analyst, would like to understand the expectation of customers and why they choose to leave. This understanding is then used to design and execute action aimed at keeping high-value customers.” I jotted down some more questions to address on the dry erase board: ■■

Which customers are likely to leave?

■■

Is it a customer canceling a product?

M o deling  ◂ 73

■■

Is it a customer leaving a product inactive?

■■

How long does the product have to be inactive to be considered churn?

■■

Is it considered churn when a customer switches from one product to another within the same company?

A woman tried to make a joke. “Churn, baby, churn!” The rest of the staff and I laughed politely. I continued, “Propensity to churn is the probability that a customer will leave a certain offer in a certain period of time. Potentially everybody is a churner. There are several approaches to model churn.” The same woman asked, “What needs to be done to prepare data?” I answered, “There is a commonly used methodology called SEMMA, which includes sampling, exploring, modifying, modeling, and assessing.1 For instance, to have valid results, you need to conduct data partition: training, validation, and test. The raw data elements must be defined to determine business problems and what you want to predict.” “Do I have to define historical data that contain the known value of what I want to predict?” the woman asked. I said, “Yes. Define additional variables about the customer demographics and transactions using RFM analysis. Include fresh data for which you want to make predictions. The data should contain the same variables as the historical data except for the target. Let’s continue tomorrow.” The next day, I continued, “Let’s say that we spend a great deal of time fixing errors, outliers, and missing values. A software tool like SAS® Enterprise Miner™ (SAS EM) helps in many ways to handle these issues. SAS EM uses methods for imputation and replacement. Let’s not forget that values may be missing for a reason. Of course, in addition to these transformations, SAS EM also provides the ability to create easily logistic regression models, decision trees, and neural networks.”2 A fellow with an orange tie asked, “SAS EM includes a rule-based method for inputting missing values that is based on the decision tree algorithm. Can you also transform variables, automatic power transformations, and automatic binning with respect to a binary target variable?” I replied, “Yes, you can apply transformations to static data such as customer demographics (age, income), contractual data, technical

74 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

quality data (network reliability), and administrative transactions data (claims data). These are considered the easiest-to-use types of data for determining customer behavior. I strongly advise you use the latter. These data used to be the only data available. Don’t forget that today 80% to 90% of big data are unstructured—that is, call center records, social media logs, and so on.” Sample data arrangements may include: ■■

Customer-centric, where you have one record per customer

■■

Roll-up of records using one or more statistic (sum, average, min, max), where you end up with even more variables

“Make sure that you eliminate data of customers who churned. When you are selecting the right data, consider the time frame of churn (week, month, quarter) since data must match the same time frame.” A woman said, “Should zero be the last point of your time series?” I was impressed by her knowledge. “Yes. One can also be used as business knowledge to derive new fields. For instance, it could be used as the tenure of a customer or the time until the customer’s contract expires. Differences, moving averages, and trends are other commonly used transformations. There is the concept of population drift. Consider the elapsed time while you prepare the model.” I told the students that you can also define ratios like: ■■

Daytime versus nighttime

■■

Charges versus income

■■

Campaign time versus peak time

“Beware the ‘Let’s model’ syndrome—when you get the model first and then get the data afterward. The most basic models used are logistic regression, decision trees, and neural networks. You may want to use regression analysis or decision trees because they are easier to understand and explain to others. Traditionally the cutoff probability levels used are a higher cutoff of .5 and a lower cutoff of .1.” A man in the audience asked, “A very common method of model development is based on regression analysis. This is done using higherorder terms. The curse of dimensionality: Higher interactions are not normally included by default in SAS EM. Al, how do we find out which interactions are important using stepwise regression?”3

M o deling  ◂ 75

I continued, “You see, we strive to find the most accurate model with the least number of variables. This is called the more eloquent model. A good model will enable you to choose from a database the top 20% of the people who have the greatest probability of leaving.” The man continued, “I think another method you can use is a decision tree. Basically it is a partitioning of the data into subsets based on categories of input variables. I use it when I want to find rules that can be generalized to a larger, new data set. Am I correct?” This method is also easy to explain to nontechnical people. I flashed him a brilliant smile. “Yes! The target variable can be nominal, interval, or binomial. The decision tree looks like a tree! The variable for the first split is the most influential one. Several methods are used to identify this variable. Decision trees don’t have to use all the variables. These models may be easy, but they could also be quite complex. Researchers pursue an increase in the purity of a node for both target events.” The man was excited. “I have noticed that the cultivation of trees requires addressing the following main questions: Which splits are to be considered? Which split is best? When should the splitting stop? Should some branches be lopped off?” “Those are good questions. Possible splits to consider are nominal inputs or normal inputs. Nominal inputs usually go faster. By default, the split search strategy uses samples in node. If the node size is larger than 5,000, binary splits are used. The next morning I tried to pound information into their eager analyst brains. “Let’s continue discussing decision trees. The splitting criteria are based on impurity reduction. Three indicators commonly used are: the Gini index, probability value, and entropy. The Gini index is also used for the two-class problem. Small Gini means a good classification. A pure node has a Gini index of zero. In addition, the probability value is often used. The delta change of the impurity should be statistically significant. A chi-squared test is done to check if the observed frequency of the event is significant at a 90% probability from the expected. What is considered the most important split?” One analyst shouted out, “The highest lockworth! Are you forgetting entropy? I believe entropy is more of an artificial intelligence way to look at this problem.”

76 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

I winked at him in solidarity. “Let’s move on. Another splitting method is regression trees. The splitting criterion is the variance reduction. It is based on f statistic and the significance tests associated.” The woman to his right raised her hand. “What is the rightsize tree?” I smiled. “Using the training data, you can always find the right tree. Then you have to decide whether to pre-prune or to prune. A suggestion is to split all the way and then prune back. A software tool like SAS EM allows having different levels in different branches. It is kind of like a real tree.” I went on, “In the advance node of SAS EM, you can choose the type of pruning algorithm to use. For instance, CHAID uses multiway splits and pre-pruning for growing classification trees. CART is different because it is restricted to binary splits. The argument is you can replicate multilevel trees with sophisticated binary splits. “Classification trees are easy to understand and can handle missing values. The drawbacks of classification trees are their inherent ­roughness and the fact that they are linear in nature, capturing only main effects.” The woman from the previous day asked me, “What about using neural networks, Al?” “Neural networks are very complex mathematical constructs that provide flexible model fitting. They can emulate standard statistical techniques like no-liners regression, clustering, and classification. They have a terminology of their own.”4 She kept going. “Then why are they used?” It was a great question. “Because they can fit nonlinear relationships much better. They have universal approximation, which means they can approximate any function quite well. You have to watch out for overfitting (lack of generalization). They exhibit several difficulties, such as being hard to interpret. Learning implies predicting something, producing a mistake, and then learning from it (get closer to the target). The optimal neural network is reached by trial and error.” She said, “I heard a multilayer perceptron can predict who is going to leave and when. I understand the input nodes are created based on the data. For instance, 10,000 input nodes for 10,000 postal area codes. Are inputs the predictor variables?”

M o deling  ◂ 77

“Yes. An optimal set of weights is what gets estimated.” The man in the back shouted out, “How does a neural net work?” “There is a combination of inputs computed and fed to the hidden layers. Transformations are then applied to net value input combinations.” Another analyst whipped up another question. “So, if inputs are summed up, is the action triggered if the summed input exceeds some set value?” “Not entirely. It may seem somewhat complicated. To find your target, use preliminary runs (default = 100, 50 to 60 runs), and then iterate optimizing your algorithms. This is hard to explain to most executives. However, neural networks analysis is very useful to confirm findings from simple techniques like regression analysis and decision trees.”

RISK SCORING MODEL “Let’s discuss the risk adjustment model I mentioned earlier so that you get the idea of how the development of these two models differs. The size of the expenditures and the transaction volume of the claims submitted characterize subscriber risk. “To build a scorecard, the analyst will use a standard methodology to define the sample window of good/bad subscribers and the objec­ tive of the scorecard and its characteristics.” I continued, “Some of the technical requirements the analyst should consider are: available data, existing operating systems, software, hardware, human resources, and time. Data must be available within the sample window definition. Data must be consistent, that is, no missing values, data conversion, data integrity (coding structure), and data volume. Data must also be true and unbiased, that is, distortion free, minimal typographical errors. “As you know, there are software tools for different purposes, such as data extraction/processing and data quality software called information management software, data reporting, exploration and ­visualization software, as well as software for statistical processing, forecasting, and optimization. “As analysts, we work with business experts in benefit cost, research and development, healthcare economics, statistical experts in

78 

▸  U N D E R S T A N D I N G T H E P R E D I C T I V E A N A L Y T I C S L I F E C Y C L E

linear and nonlinear regression, probability theories, statistical distribution, and inference.5 “Analysts work with everyone from the IT personnel to the project manager and/or at least a coordinator. Analysts work with time— actual development time, not total lapsed time.” Some general examples of expected times for different tasks are: ■■

Data definitions and collection: one to six months

■■

Business definitions: one month

■■

Statistical analyses: one to two months

■■

SAS EM reports and results presentation: one month

■■

Model validation review: one month

I asked everyone to move their chairs into a circle rather than face me to generate a collaboration and discussion environment. “I believe that if the organization has internal policies, good governance, everything will go faster and with the appropriate tools, time spent on trivialities will decrease.” An analyst asked, “What are some examples of considerations concerning the implementation of a predictive analytics process?” “As an analyst, you need to have a system to incorporate the results of your model. You will integrate the results of your model into your operational systems, scoring your data as they are processed (real time) or scoring data in batch modes. Your system would need to have the flexibility to accommodate the scoring of new data now and in the future. You will have to create a database to store constantly consistently updated scorecard information. Just in case somebody asks!” The analyst leaned her briefcase against her neighbor’s chair and got out her pen. “Now, how do we create a healthcare scorecard?” “That’s a great question. Let’s move to another model. “A healthcare scorecard is a static tool to benchmark health plan performance. One not only needs to check its validity but one must also ascertain the stability of the scorecard. As an analyst, you should provide adjustments to the scorecard if it becomes misaligned because it is of extreme importance to maintain the validity of the scorecard.” The woman is writing down everything I say. I continue, “It is important to rebuild it at least every three years. If you see a signal of

M o deling  ◂ 79

scorecard deterioration, then you have a strong indication of the need to rebuild.” I walked over to her chair. “The population stability report, or PSR, is an industry standard. You should check the stability and population shift of the scorecard. It is important to define a good versus a bad analysis report.” I paced around the circle. “The analysts should check the stability of the discriminating power of the scorecard. They also should create a final average score report. A final check on the score distribution should be derived from the scorecard versus the good/bad definition.” Good versus bad versus scorecards versus trees. “Integrating with current systems is a critical operation. There is a need for extraction and loading of operational data store. This ETL of ODS is done every time that a new scorecard is built. Each scorecard serves a particular purpose. I encourage you to create a data repository for healthcare risk analysis and management with the construction of an end-to-end solution for a complete risk management system.” I scooted another chair into the circle and sat down with everyone. “What is scoring?” An analyst raised his hand. “Scoring is a statistical method to predict the likelihood of a particular outcome of an event occurring based on a set of associated and relevant predictors, factors, regressors, and variables. This information must be historic and collective. Events can be binary (two outcomes) or multinomial (more than two outcomes). When applied to healthcare cost, it is called healthcare risk scoring.” I winked. “Yes. In healthcare, scoring is applied in a form of mathematical tool called a risk scorecard. This is completely different from other scorecards, such as a balanced scorecard, credit scorecard, response scorecard, parole scorecard (for prisoners in England), purchase scorecard, cross-sell scorecard, or churn scorecard. Basically, it is a set of scores or weights assigned to every attribute of each selected characteristic from administrative data typically.” For instance: ■■

Age, 18–25 = 0, 26–35 = 10, 36–45 = 20, 46+ = 30

■■

Marital Status, Single = 0, Married = 30

■■

Total Payments,