Building ETL Pipelines with Python: Create and deploy enterprise-ready ETL pipelines by employing modern methods [1 ed.] 9781804615256

Develop production-ready ETL pipelines by leveraging Python libraries and deploying them for suitable use cases Key Fea

648 250 6MB

English Pages 578 Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Building ETL Pipelines with Python: Create and deploy enterprise-ready ETL pipelines by employing modern methods [1 ed.]
 9781804615256

Table of contents :
Building ETL Pipelines with Python
Contributors
About the authors
About the reviewers
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Conventions used
Get in touch
Share Your Thoughts
Download a free PDF copy of this book
Part 1:Introduction to ETL, Data Pipelines, and Design Principles
1
A Primer on Python and the Development Environment
Introducing Python fundamentals
An overview of Python data structures
Python if…else conditions or conditional statements
Python looping techniques
Python functions
Object-oriented programming with Python
Working with files in Python
Establishing a development environment
Version control with Git tracking
Documenting environment dependencies with requirements.txt
Utilizing module management systems (MMSs)
Configuring a Pipenv environment in PyCharm
Summary
2
Understanding the ETL Process and Data Pipelines
What is a data pipeline?
How do we create a robust pipeline?
Pre-work – understanding your data
Design planning – planning your workflow
Architecture development – developing your resources
Putting it all together – project diagrams
What is an ETL data pipeline?
Batch processing
Streaming method
Cloud-native
Automating ETL pipelines
Exploring use cases for ETL pipelines
Summary
References
3
Design Principles for Creating Scalable and Resilient Pipelines
Technical requirements
Understanding the design patterns for ETL
Basic ETL design pattern
ETL-P design pattern
ETL-VP design pattern
ELT two-phase pattern
Preparing your local environment for installations
Open source Python libraries for ETL pipelines
Pandas
NumPy
Scaling for big data packages
Dask
Numba
Summary
References
Part 2:Designing ETL Pipelines with Python
4
Sourcing Insightful Data and Data Extraction Strategies
Technical requirements
What is data sourcing?
Accessibility to data
Types of data sources
Getting started with data extraction
CSV and Excel data files
Parquet data files
API connections
Databases
Data from web pages
Creating a data extraction pipeline using Python
Data extraction
Logging
Summary
References
5
Data Cleansing and Transformation
Technical requirements
Scrubbing your data
Data transformation
Data cleansing and transformation in ETL pipelines
Understanding the downstream applications of your data
Strategies for data cleansing and transformation in Python
Preliminary tasks – the importance of staging data
Transformation activities in Python
Creating data pipeline activity in Python
Summary
6
Loading Transformed Data
Technical requirements
Introduction to data loading
Choosing the load destination
Types of load destinations
Best practices for data loading
Optimizing data loading activities by controlling the data import method
Creating demo data
Full data loads
Incremental data loads
Precautions to consider
Tutorial – preparing your local environment for data loading activities
Downloading and installing PostgreSQL
Creating data schemas in PostgreSQL
Summary
7
Tutorial – Building an End-to-End ETL Pipeline in Python
Technical requirements
Introducing the project
The approach
The data
Creating tables in PostgreSQL
Sourcing and extracting the data
Transformation and data cleansing
Loading data into PostgreSQL tables
Making it deployable
Summary
8
Powerful ETL Libraries and Tools in Python
Technical requirements
Architecture of Python files
Configuring your local environment
config.ini
config.yaml
Part 1 – ETL tools in Python
Bonobo
Odo
Mito ETL
Riko
pETL
Luigi
Part 2 – pipeline workflow management platforms in Python
Airflow
Summary
Part 3:Creating ETL Pipelines in AWS
9
A Primer on AWS Tools for ETL Processes
Common data storage tools in AWS
Amazon RDS
Amazon Redshift
Amazon S3
Amazon EC2
Discussion – Building flexible applications in AWS
Leveraging S3 and EC2
Computing and automation with AWS
AWS Glue
AWS Lambda
AWS Step Functions
AWS big data tools for ETL pipelines
AWS Data Pipeline
Amazon Kinesis
Amazon EMR
Walk-through – creating a Free Tier AWS account
Prerequisites for running AWS from your device in AWS
AWS CLI
Docker
LocalStack
AWS SAM CLI
Summary
10
Tutorial – Creating an ETL Pipeline in AWS
Technical requirements
Creating a Python pipeline with Amazon S3, Lambda, and Step Functions
Setting the stage with the AWS CLI
Creating a “proof of concept” data pipeline in Python
Using Boto3 and Amazon S3 to read data
AWS Lambda functions
AWS Step Functions
An introduction to a scalable ETL pipeline using Bonobo, EC2, and RDS
Configuring your AWS environment with EC2 and RDS
Creating an RDS instance
Creating an EC2 instance
Creating a data pipeline locally with Bonobo
Adding the pipeline to AWS
Summary
11
Building Robust Deployment Pipelines in AWS
Technical requirements
What is CI/CD and why is it important?
The six key elements of CI/CD
Essential steps for CI/CD adoption
CI/CD is a continual process
Creating a robust CI/CD process for ETL pipelines in AWS
Creating a CI/CD pipeline
Building an ETL pipeline using various AWS services
Setting up a CodeCommit repository
Orchestrating with AWS CodePipeline
Testing the pipeline
Summary
Part 4:Automating and Scaling ETL Pipelines
12
Orchestration and Scaling in ETL Pipelines
Technical requirements
Performance bottlenecks
Inflexibility
Limited scalability
Operational overheads
Exploring the types of scaling
Vertical scaling
Horizontal scaling
Choose your scaling strategy
Processing requirements
Data volume
Cost
Complexity and skills
Reliability and availability
Data pipeline orchestration
Task scheduling
Error handling and recovery
Resource management
Monitoring and logging
Putting it together with a practical example
Summary
13
Testing Strategies for ETL Pipelines
Technical requirements
Benefits of testing data pipeline code
How to choose the right testing strategies for your ETL pipeline
How often should you test your ETL pipeline?
Creating tests for a simple ETL pipeline
Unit testing
Validation testing
Integration testing
End-to-end testing
Performance testing
Resilience testing
Best practices for a testing environment for ETL pipelines
Defining testing objectives
Establishing a testing framework
Automating ETL tests
Monitoring ETL pipelines
ETL testing challenges
Data privacy and security
Environment parity
Top ETL testing tools
Summary
14
Best Practices for ETL Pipelines
Technical requirements
Data quality
Poor scalability
Lack of error-handling and recovery methods
ETL logging in Python
Debugging and issue resolution
Auditing and compliance
Performance monitoring
Including contextual information
Handling exceptions and errors
The Goldilocks principle
Implementing logging in Python
Checkpoint for recovery
Avoiding SPOFs
Modularity and auditing
Modularity
Auditing
Summary
15
Use Cases and Further Reading
Technical requirements
New York Yellow Taxi data, ETL pipeline, and deployment
Step 1 – configuration
Step 2 – ETL pipeline script
Step 3 – unit tests
Building a robust ETL pipeline with US construction data in AWS
Prerequisites
Step 1 – data extraction
Step 2 – data transformation
Step 3 – data loading
Running the ETL pipeline
Bonus – deploying your ETL pipeline
Summary
Further reading
Index
Why subscribe?
Other Books You May Enjoy
Packt is searching for authors like you
Share Your Thoughts
Download a free PDF copy of this book

Polecaj historie