Modern Data Architectures with Python: A practical guide to building and deploying data pipelines, data warehouses & data lakes 9781801070492

Build scalable and reliable data ecosystems using Data Mesh, Databricks Spark, and Kafka Key Features Develop modern da

500 188 9MB

English Pages 318 Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Modern Data Architectures with Python: A practical guide to building and deploying data pipelines, data warehouses & data lakes
 9781801070492

Table of contents :
Modern Data Architectures with Python
Contributors
About the author
About the reviewers
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Conventions used
Get in touch
Share Your Thoughts
Download a free PDF copy of this book
Part 1:Fundamental Data Knowledge
1
Modern Data Processing Architecture
Technical requirements
Databases, data warehouses, and data lakes
OLTP
OLAP
Data lakes
Event stores
File formats
Data platform architecture at a high level
Comparing the Lambda and Kappa architectures
Lambda architecture
Kappa architecture
Lakehouse and Delta architectures
Lakehouses
The seven central tenets
The medallion data pattern and the Delta architecture
Data mesh theory and practice
Defining terms
The four principles of data mesh
Summary
Practical lab
Solution
2
Understanding Data Analytics
Technical requirements
Setting up your environment
Python
venv
Graphviz
Workflow initialization
Cleaning and preparing your data
Duplicate values
Working with nulls
Using RegEx
Outlier identification
Casting columns
Fixing column names
Complex data types
Data documentation
diagrams
Data lineage graphs
Data modeling patterns
Relational
Dimensional modeling
Key terms
OBT
Practical lab
Loading the problem data
Solution
Summary
Part 2: Data Engineering Toolset
3
Apache Spark Deep Dive
Technical requirements
Setting up your environment
Python, AWS, and Databricks
Databricks CLI
Cloud data storage
Object storage
Relational
NoSQL
Spark architecture
Introduction to Apache Spark
Key components
Working with partitions
Shuffling partitions
Caching
Broadcasting
Job creation pipeline
Delta Lake
Transaction log
Grouping tables with databases
Table
Adding speed with Z-ordering
Bloom filters
Practical lab
Problem 1
Problem 2
Problem 3
Solution
Summary
4
Batch and Stream Data Processing Using PySpark
Technical requirements
Setting up your environment
Python, AWS, and Databricks
Databricks CLI
Batch processing
Partitioning
Data skew
Reading data
Spark schemas
Making decisions
Removing unwanted columns
Working with data in groups
The UDF
Stream processing
Reading from disk
Debugging
Writing to disk
Batch stream hybrid
Delta streaming
Batch processing in a stream
Practical lab
Setup
Creating fake data
Problem 1
Problem 2
Problem 3
Solution
Solution 1
Solution 2
Solution 3
Summary
5
Streaming Data with Kafka
Technical requirements
Setting up your environment
Python, AWS, and Databricks
Databricks CLI
Confluent Kafka
Signing up
Kafka architecture
Topics
Partitions
Brokers
Producers
Consumers
Schema Registry
Kafka Connect
Spark and Kafka
Practical lab
Solution
Summary
Part 3:Modernizing the Data Platform
6
MLOps
Technical requirements
Setting up your environment
Python, AWS, and Databricks
Databricks CLI
Introduction to machine learning
Understanding data
The basics of feature engineering
Splitting up your data
Fitting your data
Cross-validation
Understanding hyperparameters and parameters
Training our model
Working together
AutoML
MLflow
MLOps benefits
Feature stores
Hyperopt
Practical lab
Create an MLflow project
Summary
7
Data and Information Visualization
Technical requirements
Setting up your environment
Principles of data visualization
Understanding your user
Validating your data
Data visualization using notebooks
Line charts
Bar charts
Histograms
Scatter plots
Pie charts
Bubble charts
A single line chart
A multiple line chart
A bar chart
A scatter plot
A histogram
A bubble chart
GUI data visualizations
Tips and tricks with Databricks notebooks
Magic
Markdown
Other languages
Terminal
Filesystem
Running other notebooks
Widgets
Databricks SQL analytics
Accessing SQL analytics
SQL Warehouses
SQL editors
Queries
Dashboards
Alerts
Query history
Connecting BI tools
Practical lab
Loading problem data
Problem 1
Solution
Problem 2
Solution
Summary
8
Integrating Continous Integration into Your Workflow
Technical requirements
Setting up your environment
Databricks
Databricks CLI
The DBX CLI
Docker
Git
GitHub
Pre-commit
Terraform
Docker
Install Jenkins, container setup, and compose
CI tooling
Git and GitHub
Pre-commit
Python wheels and packages
Anatomy of a package
DBX
Important commands
Testing code
Terraform – IaC
IaC
The CLI
HCL
Jenkins
Jenkinsfile
Practical lab
Problem 1
Problem 2
Summary
9
Orchestrating Your Data Workflows
Technical requirements
Setting up your environment
Databricks
Databricks CLI
The DBX CLI
Orchestrating data workloads
Making life easier with Autoloader
Reading
Writing
Two modes
Useful options
Databricks Workflows
Terraform
Failed runs
REST APIs
The Databricks API
Python code
Logging
Practical lab
Solution
Lambda code
Notebook code
Summary
Part 4:Hands-on Project
10
Data Governance
Technical requirements
Setting up your environment
Python, AWS, and Databricks
The Databricks CLI
What is data governance?
Data standards
Data catalogs
Data lineage
Data security and privacy
Data quality
Great Expectations
Creating test data
Data context
Data source
Batch request
Validator
Adding tests
Saving the suite
Creating a checkpoint
Datadocs
Testing new data
Profiler
Databricks Unity
Practical lab
Summary
11
Building out the Groundwork
Technical requirements
Setting up your environment
The Databricks CLI
Git
GitHub
pre-commit
Terraform
PyPI
Creating GitHub repos
Terraform setup
Initial file setup
Schema repository
Schema repository
ML repository
Infrastructure repository
Summary
12
Completing Our Project
Technical requirements
Documentation
Schema diagram
C4 System Context diagram
Faking data with Mockaroo
Managing our schemas with code
Building our data pipeline application
Creating our machine learning application
Displaying our data with dashboards
Summary
Index
Why subscribe?
Other Books You May Enjoy
Packt is searching for authors like you
Share Your Thoughts
Download a free PDF copy of this book

Polecaj historie