Learning and Operating Presto: Fast, Reliable SQL for Data Analytics and Lakehouses 9781098141851

The Presto community has mushroomed since its origins at Facebook in 2012. But ramping up this open source distributed S

490 19 6MB

English Pages 191 Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Learning and Operating Presto: Fast, Reliable SQL for Data Analytics and Lakehouses
 9781098141851

Table of contents :
Preface
Why We Wrote This Book
Who This Book Is For
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Angelica Lo Duca
Tim Meehan
Vivek Bharathan
Ying Su
1. Introduction to Presto
Data Warehouses and Data Lakes
The Role of Presto in a Data Lake
Presto Origins and Design Considerations
High Performance
High Scalability
Compliance with the ANSI SQL Standard
Federation of Data Sources
Running in the Cloud
Presto Architecture and Core Components
Alternatives to Presto
Apache Impala
Apache Hive
Spark SQL
Trino
Presto Use Cases
Reporting and Dashboarding
Ad Hoc Querying
ETL Using SQL
Data Lakehouse
Real-Time Analytics with Real-Time Databases
Introducing Our Case Study
Conclusion
2. Getting Started with Presto
Presto Manual Installation
Running Presto on Docker
Installing Docker
Presto Docker Image
Dockerfile
The etc/ directory
node.properties
jvm.config
config.properties
log.properties
catalog/.properties
Building and Running Presto on Docker
The Presto Sandbox
Deploying Presto on Kubernetes
Introducing Kubernetes
Configuring Presto on Kubernetes
presto-coordinator.yaml
presto-workers.yaml
presto-config-map.yaml
presto-secrets.yaml
Adding a New Catalog
Running the Deployment on Kubernetes
Querying Your Presto Instance
Listing Catalogs
Listing Schemas
Listing Tables
Querying a Table
Conclusion
3. Connectors
Service Provider Interface
Connector Architecture
Popular Connectors
Thrift
Writing a Custom Connector
Prerequisites
Plugin and Module
ExamplePlugin
ExampleConnectorFactory
ExampleModule
ExampleConnector
ExampleHandleResolver
Configuration
ExampleConfig
SessionProperties
TableProperties
Metadata
Data model
Handles
ExampleMetadata
ExampleClient
Input/Output
ExampleSplitManager
ExampleSplit
ExampleRecordSetProvider and ExampleRecordSet
ExampleRecordCursor
Deploying Your Connector
Apache Pinot
Setting Up and Configuring Presto
Setting up Pinot
Configuring Pinot
Configuring Presto with Pinot
Presto-Pinot Querying in Action
Conclusion
4. Client Connectivity
Setting Up the Environment
Presto Client
Docker Image
Kubernetes Node
Connectivity to Presto
REST API
Python
R
JDBC
Node.js
ODBC
Other Presto Client Libraries
Building a Client Dashboard in Python
Setting Up the Client
Building the Dashboard
Connecting to and querying Presto
Preparing the results of the query
Building the first graph
Building the second graph
Conclusion
5. Open Data Lakehouse Analytics
The Emergence of the Lakehouse
Data Lakehouse Architecture
Data Lake
File Store
File Format
Table Format
Query Engine
Metadata Management
Data Governance
Data Access Control
Building a Data Lakehouse
Configuring MinIO
Populating MinIO
Configuring HMS
Configuring Spark
Registering Hudi Tables with HMS
Connecting and Querying Presto
Conclusion
6. Presto Administration
Introducing Presto Administration
Configuration
Properties
How to configure a cluster
Sessions
Using sessions
JVM
Memory
Out-of-memory errors
Garbage collection
Monitoring
Console
Using the console for monitoring
Using the console for debugging
Using the console for going over the interactive plan
REST API
Metrics
JMX connector
REST API
JMX exporters
Management
Resource Groups
Configuring resource groups
Resource groups properties
Example
Verifiers
Setting up the system
Configuring the MySQL database
Configuring the Presto verifier
Running a test
Session Properties Managers
Configuring a session property manager
Namespace Functions
Setting up the system
Configuring a function
Running a test
Conclusion
7. Understanding Security in Presto
Introducing Presto Security
Building Secure Communication in Presto
Encryption
Keystore Management
Configuring HTTPS/TLS
Running a Presto client
Running the Presto console
Authentication
File-Based Authentication
Running a Presto client
Running the Presto console
LDAP
Kerberos
Prerequisites
Configuring the Presto coordinator and workers
Configuring the Presto client
Creating a Custom Authenticator
Authorization
Authorizing Access to the Presto REST API
Configuring System Access Control
Authorization Through Apache Ranger
Building a custom audit function
Conclusion
8. Performance Tuning
Introducing Performance Tuning
Reasons for Performance Tuning
The Performance Tuning Life Cycle
Query Execution Model
Approaches for Performance Tuning in Presto
Resource Allocation
Storage
Query Optimization
Aria Scan
Table Scanning
Repartitioning
Implementing Performance Tuning
Building and Importing the Sample CSV Table in MinIO
Converting the CSV Table in ORC
Defining the Tuning Parameters
Running Tests
Default parameters
Reducing CPU usage
Query optimization
Aria scan
Conclusion
9. Operating Presto at Scale
Introducing Scalability
Reasons to Scale Presto
Common Issues
Design Considerations
Availability
Manageability
Performance
Protection
Configuration
How to Scale Presto
Multiple Coordinators
Presto on Spark
Spilling
Using a Cloud Service
Conclusion
Index

Polecaj historie