Implementing a Modern Data Catalog to Power Data Intelligence: Make Trustworthy Data Central to Your Organization 9781492098744

Are you looking to use data as a strategic asset in your organization, so that more people can make better, data-driven

461 137 785KB

English Pages 38 Year 2023

Polecaj historie

Principles of Data Fabric: Become a data-driven organization by implementing Data Fabric solutions efficiently 9781804615225, 1804615226

Apply Data Fabric solutions to automate Data Integration, Data Sharing, and Data Protection across disparate data source

162 76 4MB Read more

Principles of Data Fabric: Become a data-driven organization by implementing Data Fabric solutions efficiently 9781804615225, 1804615226

Apply Data Fabric solutions to automate Data Integration, Data Sharing, and Data Protection across disparate data source

3,735 86 3MB Read more

Fundamentals of Data Observability: Implement Trustworthy End-to-End Data Solutions [1 ed.] 1098133293, 9781098133290

Quickly detect, troubleshoot, and prevent a wide range of data issues through data observability, a set of best practice

182 100 8MB Read more

Azure Modern Data Architecture. A Guide to Design and Implement a Modern Data Solutions

687 199 9MB Read more

Modern Data Architectures with Python: A practical guide to building and deploying data pipelines, data warehouses & data lakes 9781801070492

Build scalable and reliable data ecosystems using Data Mesh, Databricks Spark, and Kafka Key Features Develop modern da

570 204 9MB Read more

Fundamentals of Data Observability: Implement Trustworthy End-to-End Data Solutions [1 ed.] 1098133293, 9781098133290

Quickly detect, troubleshoot, and prevent a wide range of data issues through data observability, a set of best practice

585 130 9MB Read more

Modern Data Architectures with Python: A modern approach to building data ecosystems 1801070490, 9781801070492

Learn to build scalable and reliable data ecosystems using Data Mesh, Databricks Spark, and Kafka. Key FeaturesDevelop m

653 190 5MB Read more

Practical Data Quality: Learn real-world techniques to transform data quality management in your organization [1 ed.] 9781804610787

Poor data quality can lead to increased costs, hinder revenue growth, compromise decision-making, and introduce risk int

233 144 12MB Read more

The Enterprise Data Catalog 9781492098652, 9781492098711

Combing the web is simple, but how do you search for data at work? It's difficult and time-consuming, and can somet

143 46 8MB Read more

Effective Data Science Infrastructure: How to make data scientists productive 1617299197, 9781617299193

Simplify data science infrastructure to give data scientists an efficient path from prototype to production. In Effecti

290 94 8MB Read more

Implementing a Modern Data Catalog to Power Data Intelligence: Make Trustworthy Data Central to Your Organization
9781492098744

Author / Uploaded
Fadi Maali
Jason Lim

Table of contents :
1. Data Catalogs
What Is in a Data Catalog?
Data Catalog Features and Example Applications
A Framework to Characterize Data Catalogs
Summary
2. Types of Data Catalogs
Tool-Adjunct Data Catalogs
Broad Connectivity
Intelligence
Active Governance
Domain-Specific Catalogs
Broad Connectivity
Intelligence
Active Governance
Data Catalog Platforms
Broad Connectivity
Intelligence
Active Governance
Summary
3. Implementing a Data Catalog
Data Catalog in an Enterprise Data Stack
Enterprise Data Lakes
The Modern Data Stack
Data Mesh
Data Fabric
Successful Implementation of Data Catalogs
Accommodate Existing Workflows for Data Users
Focus on People
Focus on Business and Technical Metadata
Have an Adoption Plan
Measure Adoption and Impact of the Data Catalog
Summary
4. Enterprise Data Catalog Business Impact
Catalog Business Impact
Catalog Use Cases
Self-Service Business Intelligence
Data Governance and Guided Data Usage
Data Operations
Cloud and Multicloud Migration
Summary
5. Conclusion
About the Authors

Citation preview

Implementing a Modern Data Catalog to Power Data Intelligence Make Trustworthy Data Central to Your Organization Fadi Maali and Jason Lim

Implementing a Modern Data Catalog to Power Data Intelligence by Fadi Maali and Jason Lim Copyright © 2022 O’Reilly Media Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Acquisition Editor: Jessica Haberman Development Editor: Shira Evans Production Editor: Katherine Tozer Copyeditor: Justin Billing Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea June 2022: First Edition Revision History for the First Edition 2022-06-06: First Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Implementing a Modern Data Catalog to Power Data Intelligence, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.

The views expressed in this work are those of the authors and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. This work is part of a collaboration between O’Reilly and Alation. See our statement of editorial independence. 978-1-492-09874-4 [LSI]

Chapter 1. Data Catalogs A data catalog is a collection of metadata describing data assets and their usage. Modern data catalogs provide relevant functionality to support metadata management, enrichment, and search. They not only help users find relevant data but guide them on proper use of that data. Data catalogs help answer the questions: How can I find relevant data? Once I find data, can I use it? Should I use it? How should I use it? Cataloging and managing metadata in enterprises is not a new practice. Metadata repositories have existed since the 1970s and relational databases have had metadata catalogs since their early days. However, in the years since, the technology surrounding data and the role of data in the enterprise have both changed substantially. Enterprise data landscapes have grown more sophisticated—the “3 Vs” of big data (volume, velocity, and variety) are widely known. And the legislative environment mandating compliant data usage continues to grow in complexity as more people (and AI-powered programs) access and use data in new ways.1 Moreover, the growing adoption of cloud computing and SaaS results in more data residing outside the enterprise infrastructure and control. As a result, collecting, managing, and using comprehensive and accurate metadata has become paramount; and modern data catalogs are the tools that enable best practices. Modern data catalogs have grown in maturity and sophistication to address new and increasingly complex challenges. They now provide a comprehensive set of functionalities to integrate with other enterprise data

tools and to support automatic collection and enrichment of metadata, using advanced techniques such as machine learning, natural language processing, and crowdsourcing. Companies and developers alike recognize the increasing importance of modern data catalogs. In fact, a proliferation of tools and projects to build enterprise data catalogs reflects this growing interest. There are currently a number of companies specializing in enterprise data catalogs, such as Alation, Informatica, and Collibra. Many companies have built their own data catalog software and some have made them available for free.2 Additionally, all major cloud providers (AWS, GCP, and Microsoft Azure) have data catalog offerings.3 In this chapter, we describe the content of a data catalog, present a sample of features and example applications, and conclude with a summarizing framework of data catalog features.

What Is in a Data Catalog? Data catalogs contain metadata describing data assets and other related assets in an enterprise. To make this more concrete, it is helpful to take a closer look at the various types of dataset metadata and some related examples. Google Data Catalog distinguishes between technical metadata (e.g., schema information) and business metadata (structured tags), whereas the Ground project provides a more comprehensive framework to understand metadata. The Ground project introduces the ABC model of metadata, which categorizes metadata into application (information that describes how the data can be interpreted for use), behavioral (information about how data was created and used over time), and change (information about the version history of data). It is important to note that metadata can describe various aspects of data assets and their relationships. Table 1-1 lists common metadata categories. Table 1-1. Common metadata categories

Metadata category

Examples

Core metadata

Title, description, creation date, and owner

Access metadata Information about systems that host the data and how the data can be accessed Schema

Information about the various fields in the data along with their descriptions, type information, and other related information

Classification and tagging

Tags that link a dataset to a business glossary or to some defined classification within an enterprise

Versioning

Links to previous and newer versions

Relationships

Relationships with other data assets and relationships with other entities within an enterprise, such as people and dashboards

Content description

Statistics about the content of the various fields in the data

Lineage

Links from the data to its upstream and downstream datasets and other derived data products

Usage

Information about how often the data is used and by whom

Data quality

Information about the completeness, accuracy, and validity of the data

This list is not exhaustive, nor does it mean that a data catalog must include all these types of metadata. However, the list is provided to highlight the key role a data catalog can play by integrating information from various systems into one central, accessible place. In the next section, we describe how the richness of a data catalog’s content enables an equally rich set of applications and opportunities.

Data Catalog Features and Example Applications A data catalog supports search and discovery of data assets for both data consumers and producers. Robust search requires that a data catalog have the ability to collect metadata about datasets, keep them updated, and make them searchable. A data catalog, therefore, should support extracting metadata

from common data sources, such as databases, file systems, APIs, and business intelligence (BI) tools. It should also adapt to new, popular data types, such as unstructured and streaming data. As an enterprise tool, data catalogs should ensure secure access to their contents. Data catalogs should also be scalable. It is a misconception to think that data is big, and metadata is small. On the contrary! When metadata is tracking various aspects related to different versions of a large number of assets, that metadata repository itself will grow large. Thus, data catalogs need to be architected to be scalable and performant; they must be designed to handle large amounts of data. Beyond these basic functionalities, other features support innovation and collaboration. The following list illustrates how a data catalog enables innovation around data in an enterprise: Recommendation and guided navigation Searching datasets is not as simple or straightforward as a text search. A data catalog can use various explicit and implicit quality signals when ranking datasets for recommendation. Like a Google search, a data catalog can guide users to the most trusted data that comes from a reliable source and is frequently used. Furthermore, a data catalog can recommend domain experts who are automatically identified based on actual data usage. Intelligent extraction of implicit and missing information Metadata, such as description and tags, is typically provided by data creators and stewards. Other metadata, such as schema and creation date, is provided by tools that manage data. Other metadata is implicit—it can be inferred by looking at the data itself and the context of its usage. The following sidebar describes an innovative use of implicit metadata. EXAMPLE PROJECT 1: DATASETS RECOMMENDATION In a project one of the authors worked on, the data catalog was also

used to recommend related datasets. When a user views a particular dataset, the data catalog shows a list of other datasets that are commonly queried together with the dataset being viewed. Similarly, frequent users of the dataset were also shown. This information was extracted by analyzing usage logs. Techniques like data profiling can infer valuable metadata about data quality. Moreover, behavioral information extracted from data usage provides social signals about data quality. Which datasets are the most popular? How are they used? Who uses them? Odds are, if a dataset is widely used, it’s safe to trust. Approaches based on machine learning and natural language processing (NLP) can also be utilized here (see the following sidebar). HIGHLIGHTED CASE: ALATION’S BUSINESSFRIENDLY TITLES Alation uses machine learning to enrich automatically extracted metadata. Using machine learning, Alation can recommend businessfriendly titles for technical terms and abbreviations. Those recommendations are provided to human users who can approve or reject. User responses to the recommendations are then fed back to the model in order to continuously tune its performance.

Collaboration and crowdsourcing A data catalog can capture metadata from users actively, by soliciting their feedback. Wiki-like articles around data assets play host to common knowledge within an enterprise—and serve as living documents, open for experts to update. Ratings and resources are also useful; a data catalog can, for instance, allow users to rate a dataset or link to related resources and help articles. Furthermore, a data catalog can enable discussion or questions/answers

about a dataset to take place within the catalog itself. This keeps all information in one place, fosters a sense of community among data users, and supports a self-service learning environment. Managing sensitive data Sensitive data such as personally identifiable information (PII) need to be managed carefully in order to comply with regulations like GDPR and CCPA. A data catalog needs to support discovering, classifying, and tagging sensitive data assets. It needs to go beyond identifying sensitive data and guide users on compliant data usage as well. At the minimum, a data catalog can surface compliance information to the user at the point of data consumption. Interoperability and extensibility A data catalog can provide further functionality (e.g., for visualization and lineage analysis) by integrating with other specialized tools. Furthermore, a data catalog can expose its internal services via open and expressive APIs to allow building custom functionality (see the following sidebar).

EXAMPLE PROJECT 2: FIELD-LEVEL LINEAGE A project one of the authors worked on required maintaining lineage relationships and quality information at the field level rather than only at the dataset level. While the underlying data catalog supported defining fields of datasets, those fields were not first-class citizens. It was not possible to associate custom attributes to fields or define relationships among them. However, the data catalog supported defining custom entities and relationships between them. We were able to define fields as custom entities to achieve our goal.

A Framework to Characterize Data Catalogs

The list of features and functionalities a data catalog provides can be daunting and hard to comprehend. This makes it difficult to understand the main traits of a given data catalog or to compare various data catalogs. In this section, we provide a framework to understand and judge a data catalog along three key aspects: broad connectivity, intelligence, and active governance. Broad connectivity Data catalogs with broad connectivity have flexible and extensible data models. They capture metadata and represent not only data assets in an enterprise, but related entities, such as metrics, charts, AI features, and users. Catalogs with broad connectivity are designed to easily integrate with other systems in an enterprise. They expose their internal services via open and expressive APIs to allow for further extensibility. Intelligence Intelligence allows catalogs to go beyond capturing only explicit metadata. Intelligence enables catalogs to incorporate human knowledge, both passively (by tracking human usage and popularity of assets) and actively (by crowdsourcing tribal knowledge and incorporating users’ feedback.) These catalogs employ advanced techniques, such as machine learning and NLP to enrich collected metadata, extract links and relationships, and infer implicit and missing information. Active data governance Active governance guides users as they find and use data. A data catalog with active governance will surface compliance information about sensitive data at point of use, so as to encourage users to use canonical and high-quality data assets; it will also provide a way to ask domain experts for help. They actively help users to ensure compliant usage of data with features such as masking, which anonymizes PII for given user personas who are restricted from viewing it per the GDPR.

Summary

We introduced enterprise data catalogs in this chapter. We described the role of a data catalog, its structure, and the functionality and features it provides. We provided a framework to understand the characteristics of a data catalog along three aspects: broad connectivity, intelligence, and active governance. In the next chapter, we discuss types of data catalogs.

1 Examples include GDPR, CCPA, and HIPAA. 2 Examples include LinkedIn’s DataHub, WeWork’s Marquez, Lyft’s Amundsen, Airbnb’s Data Portal, Uber’s Databook, and Netflix’s Metacat. 3 See AWS Glue Data Catalog, Google Data Catalog, and Azure Data Catalog.

Chapter 2. Types of Data Catalogs In this chapter, we look at the different types of data catalogs. The goal of this chapter is not to categorize data catalogs into separate groups, but rather to provide a simple framework for how a data catalog’s focus influences the main three characteristics we talked about in the first chapter: broad connectivity, intelligence, and active governance. The three main types we discuss here are tool-adjunct data catalogs, domainspecific data catalogs, and data catalog platforms.

Tool-Adjunct Data Catalogs Tool-adjunct data catalogs are built as part of an existing tool. Typically, these catalogs are not part of a tool’s main offering, but an add-on to enhance a user’s experience or to extend the tool’s functionalities. An early example of such catalogs is the internal data catalog of relational database management systems. This catalog stores metadata essential for operating the database. Examples of metadata include a list of tables, columns in each table, and list of views. Although the main focus for this data is internal operations, relational databases typically expose this metadata to users. They also support a limited set of metadata that is meant for human consumption, such as descriptions of tables and fields. Tool-adjunct data catalogs have evolved to focus more on human users in addition to internal operations. Hive Metastore, for example, aims to facilitate discovery of data in the Hadoop ecosystem. It supports custom metadata such as tagging of data assets. It also extends its scope by providing metadata about data across multiple tools (all within the Hadoop family, though). Another example of tool-adjunct catalogs is Tableau Catalog.1 Tableau

Catalog automatically discovers and indexes all data used in Tableau. It uses this collected metadata to support discovering useful data and visualizations. It also surfaces lineage information between dashboards and their data sources. Furthermore, Tableau Catalog allows users with permissions to curate metadata by adding tags and descriptions and by certifying datasets. Data catalogs offered by many cloud providers, such as Google Data Catalog, are not specific to a single tool. Nevertheless, their focus is typically the set of data tools offered by their corresponding cloud provider. For example, Google Data Catalog integrates with Google Cloud Platform (GCP) tools such as BigQuery and Google Cloud Storage out of the box. In general, tool-adjunct data catalogs provide deep and focused integration with their main tool (or family of tools). The tight integration of such catalogs with a specific tool enables them to provide a deep, rich metadata model and to automatically collect and update this metadata. However, these catalogs are, by definition, restricted to a specific tool. The next few sections discuss the characteristics of tool-adjunct data catalogs in terms of broad connectivity, intelligence, and active governance.

Broad Connectivity Tool-adjunct data catalogs tend not to be agile. They are typically shipped with a specific data model that is hard to extend. They are tailored to collect metadata from a defined set of sources. They focus on specific use cases and are, therefore, not typically designed to be easily integrated with other tools or workflows involving external systems.

Intelligence The strong integration of tool-adjunct data catalogs with their specific tool enables them to intelligently collect explicit (and infer implicit) metadata. They typically apply advanced techniques suitable for their main focus. For example, many tool-adjunct data catalogs will monitor usage logs and usage patterns and use this information as signals of the popularity and importance of data assets. Nevertheless, this intelligence is limited because:

Metadata is coming from a specific tool and does not include signals from other sources. Catalogs typically have specific use cases (and limit the techniques they apply accordingly).

Active Governance As with intelligence, tool-adjunct data catalogs are well-positioned to actively guide usage within their tools. However, they are limited outside their corresponding tools as they are separated from other systems and tools where data is either created or accessed.

Domain-Specific Catalogs This type of data catalog is not specific to a single tool but focuses on enabling a particular use case. For example, some data catalogs focus on being a search engine for data, while others focus on lineage relationships between data assets or on data governance. Let’s examine some of these focuses: Search focus Data catalogs that focus on search bring techniques and methods from information retrieval and web search engines to the data domain within enterprises. Some of those catalogs, such as Facebook Nemo, use advanced machine learning and NLP tools to provide personalized search of data within an enterprise. The search can also use data-specific signals such as usage, popularity, and freshness to rank data assets by usefulness. Lineage capturing focus Data catalogs that focus on capturing lineage between data assets, such as Marquez, will automatically integrate with popular data processing tools, such as dbt, Apache Airflow, or Apache Spark. They maintain a complete history of data processing runs and related statistics. Those catalogs excel

at supporting data operations teams. Governance focus Data catalogs that focus on governance are concerned mainly with controlling data access and ensuring that data is used according to defined policies; this includes external policies such as data privacy laws as well as policies defined with an enterprise. Those catalogs apply techniques to identify data assets with sensitive information and to monitor data flow and access. As shown in the preceding examples, domain-specific catalogs use advanced techniques to address their corresponding domain. But expanding beyond that domain is nearly impossible; extending their usage, their data model, or integrating them with external systems are challenging tasks. The next few sections discuss the characteristics of domain-specific data catalogs in terms of broad connectivity, intelligence, and active governance.

Broad Connectivity Domain-specific catalogs tend not to be agile. They are typically shipped with a specific data model that is hard to extend and they focus on a specific domain. They generally fit better in a defensive data strategy that focuses on control (e.g., minimizing the downside risk and ensuring compliance) more than on flexibility or creating a competitive advantage with data.

Intelligence Domain-specific catalogs utilize advanced techniques to collect explicit and implicit metadata from a number of related tools. However, this is typically limited to their main domain.

Active Governance Domain-specific catalogs can play an active role within their corresponding

domain. For example, catalogs focusing on lineage actively recommend actions to support data operations teams with tasks such as backfilling historical data or deprecating data assets.

Data Catalog Platforms Data catalog platforms take a more holistic view to focus not only on data assets within an enterprise, but also on the surrounding ecosystem (including business and people elements). They are typically characterized by an extensible data model that can grow to define various assets and concepts, such as metrics, charts, AI features, and users. Data catalog platforms typically augment their data with a focus on business and users to support collaborative governance and enrichment of metadata and to interlink data with business glossaries and dictionaries. Moreover, they are architected to make them easily integrable with other systems. Integration with other systems is a core design element of data catalog platforms. They are built as a platform for other tools to be built upon in order to enable various use cases. Data catalog platforms can be a catalog of catalogs, which not only utilizes the depth of specialized catalogs, but also extends them by pulling together metadata from across the enterprise. The holistic approach of data catalog platforms is essential to enable data intelligence, which is intelligence about the data, as informed by metadata (and not intelligence informed by the data itself). According to the International Data Corporation (IDC), data intelligence helps organizations answer six fundamental questions: Who is using what data? Where is the data, and where did it come from (lineage and provenance)? When is data being accessed, and when was it last updated? Why do we have data? Why do we need to keep (or discard) data?

How is data being used, or perhaps more specifically, how should data be used? What relationships are inherent within data and with data consumers? Answering these questions requires collecting metadata from multiple systems across the enterprise and integrating the business context surrounding the data, a task that data catalog platforms address. Alation is an example of a catalog as a platform for data intelligence. It has an extensible data model and a modern architecture to facilitate integration with other systems, employing a number of advanced techniques, such as machine learning and NLP, to automatically surface context and remove manual efforts. Apache Atlas is another example of a data catalog platform. Apache Atlas is an open source data catalog with a flexible data model, data access API, and a query language.

Broad Connectivity The main focus areas of data catalog platforms are agility and broad connectivity. They are characterized by an extensible data model and an architecture that focuses on easy integration with other systems. They also typically expose their internal services via open and expressive APIs, which allow for further extensibility across the modern data stack.

Intelligence Data catalog platforms apply advanced techniques to infer metadata, such as popularity of a given asset or common joins. Their interoperability allows them to harness intelligence built in other systems with which they integrate.

Active Governance Positioned as a platform within the enterprise data ecosystem, data catalog platforms continuously integrate with other systems where data is either

created or accessed. They can play an essential role in guiding users to proper use of data and away from improper use. Examples of governance features that guide correct and compliant usage in a catalog include: Intelligent SQL editors These can store past queries, surface popular queries for given datasets, and flag data pursuant to compliance laws as you’re writing. Quality flags These signal to users, at a glance, whether data is trusted, questionable, or deprecated, so people know what to trust and what to avoid. Automated business glossary This boosts self-service capabilities with definitions, policies, rules, and KPIs, and by engaging and incentivizing relevant stakeholders—such as line-of-business stewards and subject matter experts—to define, document, and certify business assets. Stewardship dashboards By measuring, monitoring, automating, and tracking stewardship activities, data leaders can streamline governance activities. Table 2-1 summarizes the three types of catalogs discussed in this chapter. Table 2-1. Summary of data catalog types Tool-adjunct catalogs

Domain-specific catalogs

Data catalog platforms

Purpose

Govern and manage assets within a tool.

Govern and manage assets in a domain. Built to address specific needs within a domain.

Govern and manage assets across the enterprise. Provide a platform to build upon.

Broad connectivity

Limited— internal to the tool

Limited to the domain

Extensible data model and easy to integrate

Intelligence

Deep but narrow Deep but narrow

Deep and wide

Active governance

Tool specific

Across the enterprise

Domain specific

Summary Data search and discovery is at the core of every data catalog. However, the scope and approach of catalogs vary. This is also typically affected by the catalog’s origin and evolution. Enterprises do not need to exclusively choose one type or another. In fact, it is common for data catalog platforms to be a catalog of catalogs collecting metadata from different tools and other catalogs to provide a single, central platform within an enterprise. Nevertheless, choosing a specialized catalog for a given use case without considering the enterprise’s future needs is a common pitfall worth avoiding. A highly specialized data catalog that is hard to integrate or extend can be a short-term solution that limits the long-term expansive capabilities.

1 Tableau Catalog was introduced in 2019.

Chapter 3. Implementing a Data Catalog The myriad innovations and novel concepts in data analytics over the last few years have birthed several conceptual frameworks and architectural approaches. From big data to data lakes, data mesh to data fabric, the field is evolving rapidly. These are concepts with definitions still being defined and debated by pundits, researchers, and vendors, yet they are already shaping data analytics practices in many enterprises. In this chapter, we briefly discuss a few of these recent architectures and concepts as well as how a data catalog can support each of them. We then close the chapter with a number of recommendations for successful implementation of an enterprise data catalog.

Data Catalog in an Enterprise Data Stack In the next few sections, we discuss four popular recent concepts in data analytics architecture—data lakes, the modern data stack, data mesh, and data fabric.

Enterprise Data Lakes A data lake is a centralized repository that allows enterprises to store all structured and unstructured data at any scale. Data lakes are characterized by open-ended schema-on-use data and agile development. Many enterprises have adopted data lakes as an alternative or complement to traditional data warehouses. Given growing numbers of data sources and the increasing importance of data analyses, a data lake allows analysts to easily ingest and transform data at a rapid pace. Furthermore, a data lake contains many data sources within an organization regardless of data quality

characteristics. While data lakes aim to improve analyst productivity, the challenges with the volume, variety, and veracity of the data can have counterproductive effects if not managed carefully. When data analysts and data scientists use data to create dashboards, reports, and models, they need trustworthy and reliable data. Bad data generates bad insights, leading to a “garbage in, garbage out” problem. Without a cataloging mechanism for content and context of data in a data lake, analysts are often unable to discover what data exists, and how it has been generated and previously used by peers. After an initial surge in data-lake activity, fear of bad data will cause analysts to stop using the data lake. A commonly used analogy here is that a data lake can easily devolve into a data swamp. Given the typical volume of an enterprise data lake, manually created data catalogs will definitely not scale. Metadata collection needs to be automated. However, relying on collecting only explicit technical metadata is not sufficient. Providing proper guidance for data usage requires catalogs to also collect business, operations, and social metadata. In a data lake setting, some data assets will lack any descriptive metadata. Therefore, gleaning the missing metadata via inference and machine learning becomes essential. Crowdsourcing human knowledge is another benefit that catalogs provide. While machine learning helps scale and automate metadata curation, the people who know most about the data can contribute further knowledge and context and provide validity checks of any automatically added metadata. In summary, a capable data catalog supports the data lake in improving analysts’ productivity. A data catalog makes data findable, surfaces relationships, and provides context around each data asset to allow users to assess quality and trustworthiness of the data.

The Modern Data Stack The modern data stack is a suite of tools used for data integration. Those tools are typically hosted in the cloud and require little technical configuration.

Historically, for their analytics needs, enterprises relied upon a set of tightly coupled tools, typically provided by a single vendor. Nowadays, nearly all of the components of a traditional data warehouse are independent and interchangeable. Those independent tools can be flexibly combined to provide a modern data stack. It is common for current enterprises to have separate tools for data ingestion, data pipelines, data storage and querying, data visualization and business intelligence, and data quality. Furthermore, data can flow in the opposite direction out of the data warehouse in what is referred to as reverse extract, transform, and load (ETL). For example, data aggregated in a central data warehouse can be copied to SaaS tools used for marketing and sales. As a result of adopting the modern data stack, data and its associated metadata are distributed across multiple independent systems. Data can traverse a large number of systems and transform several times before it’s consumed. Understanding lineage of data, as well as the whole of available data, becomes challenging. A data catalog can help address these challenges by integrating and providing visibility into the metadata from various systems. A data catalog also tracks the relationships between data assets across systems. When a number in a dashboard looks suspicious and requires further investigation, a data catalog can provide the full data provenance, showing the flow of the data from original sources to the final dashboard. This lineage data, together with other metadata in a catalog, enables an impact analysis to find out not only what is impacted, but who is impacted and needs to be notified.

Data Mesh Data mesh was introduced by Zhamak Dehghani of Thoughtworks in 2019. It is a data platform architecture that embraces the distributed nature of data and calls for domain product owners to be responsible for delivering data as a product. Data mesh replaces the common model of having a centralized team (such as a data engineering team) that manages and transforms data. In contrast, a data

mesh architecture calls for responsibilities to be distributed to domain data owners. Those closest to the data should be responsible for it. A core premise of data mesh is federating data ownership among domain data owners who are responsible for their data as a product. Offering the data as a product requires the data to be discoverable and to have explicitly stated quality characteristics and a clearly defined access method. Such requirements are at the core of what data catalogs support. With support for data labeling, curation, and crowdsourced feedback, data catalogs are well positioned to offer data as a product. Furthermore, data catalogs support the enforcement of compliant data usage, which becomes more important when data ownership is not managed centrally.

Data Fabric Gartner defines data fabric as a “design concept that serves as an integrated layer (fabric) of data and connecting processes. A data fabric uses continuous analytics over existing, discoverable, and inferred metadata assets to support the design, deployment, and utilization of integrated and reusable data across all environments, including hybrid and multicloud platforms.”1 Gartner’s report lists a number of functionalities and characteristics a data catalog must have to enable a data fabric, including the ability to convert passive metadata to active metadata via continuous analysis of metadata to compute key metrics. The report also emphasizes the ability of a data catalog to capture and represent relationships between various data assets and other related assets. Data fabric is not a single object or a product. It is instead a set of integrated technologies that accelerate value from enterprise metadata. Data catalogs, the main tool to manage enterprise metadata, play a foundational role in enabling data fabric. In fact, according to Gartner, a data catalog is the foundational pillar for a comprehensive data fabric.

Successful Implementation of Data Catalogs

Enterprises typically become interested in data catalogs when they have a specific use case or need in mind. Data governance, self-service analytics, and cloud data migration are common examples. Having a specific need or use case helps focus efforts and measure impact. However, as with other technical efforts within enterprises, it is essential to prepare for long-term sustainable success and to have a plan to maximize successful adoption. We conclude this chapter by providing a number of recommendations that can help achieve a successful implementation of an enterprise data catalog. These recommendations are nontechnical and focus on the human and cultural dimensions of data catalogs instead.

Accommodate Existing Workflows for Data Users Instead of defining a new workflow as part of implementing a data catalog, look for existing workflows and meet users there. If analysts have a tool where they browse and query the data, work on utilizing the data catalog to enrich user experience within their favorite tool. Alation, for example, has an integrated SQL query composer to provide useful functionality such as autosuggestions and trust flags. Auto-suggestions recommend columns to use and explain what they mean, similar to a spell-checker. Trust flags automatically highlight data curated in the data catalog as endorsements, warnings, and deprecations, including whether specific data relates to a data policy.

Focus on People Build a community around the catalog. Make sure data producers, stewards, and consumers are all involved and empowered to enrich the content of the catalog. Establish a leader or a team to have clear ownership of the data catalog. Leverage the data catalog to identify stewards and other subject matter experts. A data catalog that tracks behavioral metadata, such as top users, can help you find potential champions within your organization.

Focus on Business and Technical Metadata A common mistake when implementing a data catalog is to focus only on technical metadata. This limits its use and the potential value. It also excludes business users who have valuable related input or need to use the catalog. A catalog should in fact function as a two-way translation layer between technical and business users. The business glossary is one data catalog feature that can align business and technical teams on shared language. By tying key terms, such as revenue, expense, or profit, to hard metrics, the business glossary can promote a shared language that connects such business terminology to your data.

Have an Adoption Plan Motivating people to use the catalog is essential. Make sure you develop advocates and business champions. Additionally, ensure you have an incentive system to encourage users to participate in crowdsourcing, selfservice analysis, and data and knowledge sharing.

Measure Adoption and Impact of the Data Catalog As with any other product, it is essential to observe the usage of the data catalog and respond accordingly. Ideally, a catalog captures its usage statistics and uses this data to provide out-of-the-box support for prioritizing efforts and maximizing adoption. Moreover, tracking the real business impact of data catalogs is essential. This can be described, quantitatively or qualitatively, in terms of time savings, cost savings, and analysis quality.

Summary This chapter describes how a data catalog can play an essential role to enable, facilitate, and optimize data analytics within enterprises. This is true for enterprises with various approaches and at various levels of maturity in their data stacks. You’ve seen the role data catalogs can play to support data lakes,

a modern data stack, data mesh, and data fabric. We also presented a number of recommendations for successful implementation of an enterprise data catalog. When implementing a data catalog, it is important to remember that a successful catalog requires real buy-in, thought, and accountability. Maximizing successful adoption of the catalog should always be planned early and monitored continuously.

1 Gupta, Ashutosh. “Data Fabric Architecture Is Key to Modernizing Data Management and Integration.” Gartner, May 11, 2021. https://oreil.ly/7k5A8.

Chapter 4. Enterprise Data Catalog Business Impact In this chapter, we describe the business impact—both qualitative and quantitative—of data catalogs. We then provide a number of concrete use cases that a data catalog supports. In particular, we discuss self-service analytics, data governance and guided data usage, data operations, and cloud and multicloud migration.

Catalog Business Impact As described before, the core value of an enterprise data catalog is that it is the central place to bring together all information about data in an enterprise. The emphasis on central and all information makes an enterprise data catalog more efficient than a combination of siloed tool-adjunct data catalogs in an enterprise. Furthermore, the enterprise data catalog integrates, interlinks, and enriches the various pieces of metadata. It is the place where the value of the whole becomes greater than the sum of its parts. Not many resources quantitatively report on data catalogs’ impact, but one notable report, created by Forrester Consulting and commissioned by Alation, conducted a total economic impact (TEI) study to examine the potential return on investment (ROI) enterprises may realize by deploying Alation. Forrester interviewed seven customers with experience using the Alation Data Catalog. Based on this, in October 2019, Forrester reported the following risk-adjusted present value (PV) quantified benefits of using enterprise data catalogs: Analyst productivity improved due to shortened data discovery. Improvements amount to savings of $2.7 million. Business user productivity improved from self-service. Improvements

amount to savings of $584,182. Data engineer productivity improved due to user self-service. Improvements amount to savings of $165,065. Savings from faster onboarding of new analysts amount to $286,085. A 2019 Gartner report predicted that by 2021, organizations that offer a curated catalog of internal and external data to diverse users will realize twice the business value from their data and analytics investments than those that do not. The report also stated that by 2022, over 60% of traditional IT-led data catalog projects that do not use ML to assist in finding and inventorying data distributed across a hybrid/multicloud ecosystem will fail to be delivered on time. Data catalogs facilitate data discovery and usage, support data governance and collaboration around data, and help ensure compliance use of data. In the next section, we discuss how these functionalities can be utilized in a number of use cases to drive business value.

Catalog Use Cases This section discusses the support data catalogs can provide in four use cases: self-service business intelligence, data governance, data operations, and cloud migration.

Self-Service Business Intelligence In the era of digital disruptions, the businesses that win are more agile and can make decisions at the speed of market change and competition. This means that the pool of decision makers needs to be enlarged. The historical request model—which makes IT a bottleneck—no longer works. Instead, business users, business analysts, and data scientists need the ability to self-discover trustworthy data. Business users need to then make decisions with timely, trustworthy data. This puts the data catalog front and center in self-service BI.

Self-service BI initiatives help organizations become more data-driven and democratize access to data. But data can’t be used if it can’t be found. Search and discovery of trustworthy data is a core value of enterprise data catalogs, and the value extends well beyond business users. It is often said that data scientists and data analysts spend only 20% of their time doing data analysis work, with 80% consumed by data “issues.” The bulk of their time is spent finding, evaluating, understanding, and preparing data before analysis can begin. A data catalog inverts this principle by enabling data analysts and data scientists to spend 20% of their time looking for data and 80% performing analysis. The value of this—to both the organization as a whole and to individual analysts—cannot be overstated. Not only does the organization benefit from improved efficiency, collaboration, and innovation at scale, analysts and others benefit tremendously from improved job satisfaction. Analysts do not enjoy hunting, gathering, and verifying the trustworthiness of data, and have ranked these tasks as the most unpleasant chores of the job. Another important item highlighted in the Forrester report referenced previously is the role of a data catalog to speed up onboarding new analysts. In a project the authors worked on, catalog data was used to provide personalized recommendations of datasets that are potentially of interest to a new employee based on looking at access patterns of other team members and people with similar roles. By facilitating self-service BI, data catalogs improve employees’ productivity, reduce time to insight, and positively impact employees’ satisfaction and, therefore, retention.

Data Governance and Guided Data Usage Data users need to first know where to find relevant data, and data catalogs are essential tools to address this need. However, after finding the data, users need to understand the data, as well as know how (and whether) to use it. There are a number of ways a data catalog can guide the proper usage of data.

Dataset and expert recommendation In a self-service environment with multiple publishers, it’s impossible to completely avoid data redundancy and overlapping. Multiple data assets with similar content, but possibly with varying quality, will exist. A data catalog can guide users to trusted data that comes from a reliable source and is frequently used. A data catalog can also use various explicit and implicit quality signals when ranking datasets for recommendation. Some of those signals are discussed next. Furthermore, a data catalog can recommend domain experts who are automatically identified based on actual data usage. Certified datasets Subject matter experts can provide endorsement to high-quality datasets that can be trusted. This can be in the form of a star ranking or certification flag. Similarly, deprecated or unmaintained data can be flagged or given a low star ranking, not unlike restaurant reviews on Yelp or product reviews on Amazon. These certifications are automatically utilized by the catalog at the point of data use. Potential data consumers can quickly identify trustworthy data and save time. Data certification complements the recommendations by offering a way to promote data through curation. Data quality In addition to explicit dataset certifications, a data catalog can integrate data quality signals from dedicated external systems or perform data profiling to surface quality characteristics of various data sources. These quality signals are accessible to the users intending to use a dataset and can also be used when recommending a dataset or ranking it in a search result. Related policies and context A data catalog provides context around data for users to guide its usage. This includes flagging sensitive data, such as PII, and related policies and regulations like GDPR and CCPA. Moreover, a data catalog can provide the business context around the data by linking it to a business glossary. This context improves understanding of data and facilitates proper usage across all

user groups. In summary, via guiding users to high-quality data, ensuring compliant usage, and improving user confidence in data, data catalogs improve user productivity, reduce time to insights, and reduce risk associated with improper use of data or with using low-quality data.

Data Operations Data flows within an enterprise are becoming increasingly complex, due to the increased use of SaaS tools to manage and process data as well as the growing number of initiatives for self-service ETL. Consequently, managing these flows can be expensive. Without the support of tools, manual management of these data flows requires a potentially large dedicated team of engineers. Furthermore, it becomes very challenging to avoid the risks associated with using low-quality data to support decisions and noncompliant use of sensitive data. Here are a number of related use cases that a data catalog can support. Maintaining data delivery SLAs It is common for data teams to provide datasets bound to defined SLAs in terms of frequency of updates and freshness. The freshness of a given dataset is a function of the freshness of all its upstream data sources. Accurate management of data SLAs is essential to obtaining user confidence in data and to support informed decision making. A data catalog’s support of automatic querying and visualization of the data flow helps assess the feasibility of SLAs and impact of upstream changes and delay. Handling data quality issues and incident response When a particular dataset has a quality issue, the data operations team needs to understand the impact of the issue on all downstream data products. Once the issue is fixed, downstream data products need to be backfilled, usually in a specific order that respects their interdependency. Lineage data within a data catalog can be used to support, or possibly automate, such tasks.

Data deprecation Similar to data quality issues, deprecating a dataset or a field of a dataset requires understanding the impact on downstream data products. This requires a data operations team to understand not only what is affected but who is affected and needs to be contacted. In summary, a data catalog is an essential tool for data operations teams. A data catalog supports team efficiency and improves the quality of the data provided as a foundation for decision making.

Cloud and Multicloud Migration Cloud computing services provide enterprises many promising advantages including cost reductions, instant scalability, and new innovations available only in the cloud. Organizations are increasingly migrating from on-premise resources to the cloud or adopting a hybrid model with only part of the infrastructure managed and owned by the organization itself. However, migrating to the cloud is a challenging task. The seamless scalability provided by the cloud and its pay-per-usage cost model can complicate migration initiatives and often incur unanticipated costs. During the migration, organizations find themselves in a state of flux as data is spread across on-premises and cloud environments. Data catalogs can help organizations during the planning and the execution of data migration to the cloud. An attempt to do a complete “lift-and-shift” strategy will bring all the organization data to the cloud. Data that is never used will still incur an unnecessary additional cost. During planning data migration, data catalogs support identifying commonly used, relevant data assets that need to be moved to the cloud, reducing costs and optimizing the cloud data environment. Additionally, the data catalog helps IT prioritize and manage the migration by providing information about data dependency. During data migration, analysts can no longer be sure where to go for the best, most appropriate data. Data catalogs signal to users when and where

data is being migrated from the old to new system. Therefore, data consumers can continue to find the data they need, regardless of where it resides during the migration journey. In summary, data catalogs support and accelerate business goals when migrating to the cloud. Data catalogs help reduce the cost of migrating to the cloud, increase productivity by providing transparency throughout the migration journey, and accelerate adoption, increasing the value of migrated data to the business.

Summary This chapter described the potential business impact of data catalogs as well as discussed a number of use cases, which illustrate the potential of data catalogs as a business-critical investment. It is worth keeping in mind two things: Remember the previous chapter’s recommended practice of having a clear use case to start with when implementing a data catalog and to continuously measure the achieved impact. Be aware of the common pitfall of using a specialized catalog for a chosen use case without considering the enterprise’s future needs.

Chapter 5. Conclusion In this report, we described what a modern data catalog is, how to think about incorporating it into your modern data tech stack, and the business value you should expect from it. We provided a framework to understand the characteristics of a data catalog by three aspects, as described in “A Framework to Characterize Data Catalogs”: Broad connectivity Data catalogs with broad connectivity have flexible and extensible data models. They capture metadata and represent not only data assets in an enterprise, but related entities, such as metrics, charts, AI features, and users. Catalogs with broad connectivity are designed to easily integrate with other systems in an enterprise. They expose their internal services via open and expressive APIs to allow for further extensibility. Intelligence Intelligence allows catalogs to go beyond capturing only explicit metadata. Intelligence enables catalogs to incorporate human knowledge, both passively (by tracking human usage and popularity of assets) and actively (by crowdsourcing tribal knowledge and incorporating users’ feedback). These catalogs employ advanced techniques, such as machine learning and NLP to enrich collected metadata, extract links and relationships, and infer implicit and missing information. Active data governance Active governance guides users as they find and use data. A data catalog with active governance will surface compliance information about sensitive data at point of use, so as to encourage users to use canonical and high-quality data assets; it will also provide a way to ask domain experts for help. They actively help users to ensure compliant usage of data with features such as masking, which anonymizes PII for given user

personas who are restricted from viewing it per the GDPR. In Chapter 2, we described three types of catalogs: tool-adjunct, domainspecific, and data catalog platforms. The different types vary in their scope and approach. Enterprises do not need to exclusively choose one type or another. In fact, it is common for data catalog platforms to be a “catalog of catalogs” collecting metadata from different tools and other catalogs. Chapter 3 described the role of a data catalog in an enterprise data stack, and presented lessons and recommendations for a successful implementation of a data catalog in an enterprise. Finally, in Chapter 4, we described the impact of data catalogs to make the business case for investing in an enterprise data catalog. Through integrating, interlinking, and enriching the various pieces of metadata, catalogs evolved to play an essential role in a wide variety of use cases within enterprises. In fact, enterprise catalogs provide a data intelligence platform that addresses varied use cases from self-service analytics to governance and cloud data migration. When embarking on a data catalog project within an enterprise, the first challenge is often to clearly identify the goals and primary use cases to be addressed. “Data catalog” is a frequently misunderstood term and so can mean different things to different people. It helps to start with a specific use case and a set of data sources. Investing time to clarify the goals and align on the terminology is paramount and should not be overlooked. Another typical challenge is the existence of multiple catalogs in an enterprise. As described in this report, cataloging metadata is an established practice and many tools have their own catalogs. The value an enterprise data catalog provides is in bringing all this metadata together into a central place. Siloed catalogs are essential for operations within their own tool or domain, but enabling data intelligence and achieving an open data strategy require a holistic approach to metadata. This is a task for which enterprise data catalogs are designed. Choosing the right tool to build a catalog is another challenge. There is no

shortage of free commercial options, but it’s important to select one with all the necessary features to address the given use case. On the other hand, a highly specialized data catalog that is hard to integrate with or extend can be a short-term solution limiting the ability to expand into future solutions. Finally, the people and culture aspects of data cataloging are equally important. It is essential to have a plan for building a community around the catalog to ensure long-term success and sustainability. A key factor of success of a data catalog is adoption. Encouraging people to use and contribute to it—so the broader community gets value from it—requires some behavior change and creativity. However, once they do experience value, it will be difficult to go back. Our hope is that this report helps bring some clarity and useful guidance for people considering, working on, or already advancing their efforts building an enterprise data catalog.

About the Authors Fadi Maali is a software engineer working with data for the last twelve years. He has experience in building and operating cloud data platforms and in building data ingestion pipelines to support both batch and stream data in a robust, easy to operate, and well-governed manner. Fadi has a particular focus on data quality and data catalogs. Fadi led the implementation of an internal data catalog during his work at Zendesk. He is also the lead editor of the DCAT Specification, a W3C Recommendation for a vocabulary to describe government data catalogs. Fadi holds a PhD degree in Computer Science from the National University of Ireland Galway. His PhD thesis, submitted in 2016, focused on querying large graph data. Jason Lim is the director of product and cloud marketing at Alation. Jason cofounded Koombah, a real-estate startup in China, and AsiaRecon, a technology and innovation tour in Australia. Jason was a contributing writer to Forbes Asia, covering startups and tech trends. Jason is originally from Sydney, Australia and now lives in California. You can contact the author via email ([email protected]), on Twitter, or LinkedIn.