Elasticsearch in Action, Second Edition [2 ed.] 9781617299858

Build powerful, production-ready search applications using the incredible features of Elasticsearch. In Elasticsearch i

110 14 19MB

English Pages 593 Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Elasticsearch in Action, Second Edition [2 ed.]
 9781617299858

Table of contents :
inside front cover
Praise for the First Edition
Elasticsearch in Action
Copyright
dedication
contents
front matter
foreword
preface
acknowledgments
about this book
Who should read this book
How this book is organized: A road map
About the code
liveBook discussion forum
about the author
about the cover illustration
1 Overview
1.1 What makes a good search engine?
1.2 Search is the new normal
1.2.1 Structured vs. unstructured (full-text) data
1.2.2 Search supported by a database
1.2.3 Databases vs. search engines
1.3 Modern search engines
1.3.1 Functionality
1.3.2 Popular search engines
1.4 Elasticsearch overview
1.4.1 Core areas
1.4.2 Elastic Stack
1.4.3 Elasticsearch use cases
1.4.4 Unsuitable Elasticsearch uses
1.4.5 Misconceptions
1.5 Popular adoption
1.6 Generative AI and modern search
Summary
2 Getting started
2.1 Priming Elasticsearch with data
2.1.1 An online bookstore
2.1.2 Indexing documents
2.1.3 Indexing our first document
2.1.4 Indexing more documents
2.2 Retrieving data
2.2.1 Counting documents
2.2.2 Retrieving documents
2.3 Full-text search
2.3.1 Match query: Books by an author
2.3.2 Match query with the AND operator
2.3.3 Indexing documents using the _bulk API
2.3.4 Searching across multiple fields
2.3.5 Boosting results
2.3.6 Search phrases
2.3.7 Phrases with missing words
2.3.8 Handling spelling mistakes
2.4 Term-level queries
2.4.1 The term query
2.4.2 The range query
2.5 Compound queries
2.5.1 Boolean (bool) query
2.5.2 The must clause
2.5.3 The must_not clause
2.5.4 The should clause
2.5.5 The filter clause
2.6 Aggregations
2.6.1 Metrics
2.6.2 Bucket aggregations
Summary
3 Architecture
3.1 A high-level overview
3.1.1 Data in
3.1.2 Processing data
3.1.3 Data out
3.2 The building blocks
3.2.1 Documents
3.2.2 Indexes
3.2.3 Data streams
3.2.4 Shards and replicas
3.2.5 Nodes and clusters
3.3 Inverted indexes
3.4 Relevancy
3.4.1 Relevancy scores
3.4.2 Relevancy (similarity) algorithms
3.5 Routing algorithm
3.6 Scaling
3.6.1 Scaling up (vertical scaling)
3.6.2 Scaling out (horizontal scaling)
Summary
4 Mapping
4.1 Overview of mapping
4.1.1 Mapping definition
4.1.2 Indexing a document for the first time
4.2 Dynamic mapping
4.2.1 The mechanism for deducing types
4.2.2 Limitations of dynamic mapping
4.3 Explicit mapping
4.3.1 Mapping using the indexing API
4.3.2 Updating schema using the mapping API
4.3.3 Modifying existing fields is not allowed
4.3.4 Type coercion
4.4 Data types
4.5 Core data types
4.5.1 The text data type
4.5.2 The keyword data types
4.5.3 The date data type
4.5.4 Numeric data types
4.5.5 The boolean data type
4.5.6 The range data types
4.5.7 The IP address (ip) data type
4.6 Advanced data types
4.6.1 The geo_point data type
4.6.2 The object data type
4.6.3 The nested data type
4.6.4 The flattened data type
4.6.5 The join data type
4.6.6 The search_as_you_type data type
4.7 Fields with multiple data types
Summary
5 Working with documents
5.1 Indexing documents
5.1.1 Document APIs
5.1.2 Mechanics of indexing
5.1.3 Customizing the refresh process
5.2 Retrieving documents
5.2.1 Using the single-document API
5.2.2 Retrieving multiple documents
5.2.3 The ids query
5.3 Manipulating responses
5.3.1 Removing metadata from the response
5.3.2 Suppressing the source document
5.3.3 Including and excluding fields
5.4 Updating documents
5.4.1 Document update mechanics
5.4.2 The _update API
5.4.3 Scripted updates
5.4.4 Replacing documents
5.4.5 Upserts
5.4.6 Updates as upserts
5.4.7 Updating with a query
5.5 Deleting documents
5.5.1 Deleting with an ID
5.5.2 Deleting by query (_delete_by_query)
5.5.3 Deleting with a range query
5.5.4 Deleting all documents
5.6 Working with documents in bulk
5.6.1 Format of the _bulk API
5.6.2 Bulk indexing documents
5.6.3 Independent entities and multiple actions
5.6.4 Bulk requests using cURL
5.7 Reindexing documents
Summary
6 Indexing operations
6.1 Indexing operations
6.2 Creating indexes
6.2.1 Creating indexes implicitly (automatic creation)
6.2.2 Creating indexes explicitly
6.2.3 Indexes with custom settings
6.2.4 Indexes with mappings
6.2.5 Index with aliases
6.3 Reading indexes
6.3.1 Reading public indexes
6.3.2 Reading hidden indexes
6.4 Deleting indexes
6.5 Closing and opening indexes
6.5.1 Closing indexes
6.5.2 Opening indexes
6.6 Index templates
6.6.1 Creating composable (index) templates
6.6.2 Creating component templates
6.7 Monitoring and managing indexes
6.7.1 Index statistics
6.7.2 Multiple indexes and statistics
6.8 Advanced operations
6.8.1 Splitting an index
6.8.2 Shrinking an index
6.8.3 Rolling over an index alias
6.9 Index lifecycle management (ILM)
6.9.1 Index lifecycle
6.9.2 Managing the index lifecycle manually
6.9.3 Lifecycle with rollover
Summary
7 Text analysis
7.1 Overview
7.1.1 Querying unstructured data
7.1.2 Analyzers to the rescue
7.2 Analyzer modules
7.2.1 Tokenization
7.2.2 Normalization
7.2.3 Anatomy of an analyzer
7.2.4 Testing analyzers
7.3 Built-in analyzers
7.3.1 The standard analyzer
7.3.2 The simple analyzer
7.3.3 The whitespace analyzer
7.3.4 The keyword analyzer
7.3.5 The fingerprint analyzer
7.3.6 The pattern analyzer
7.3.7 Language analyzers
7.4 Custom analyzers
7.4.1 Advanced customization
7.5 Specifying analyzers
7.5.1 Analyzers for indexing
7.5.2 Analyzers for searching
7.6 Character filters
7.6.1 HTML strip (hmtl_strip) filter
7.6.2 The mapping character filter
7.6.3 Mappings via a file
7.6.4 The pattern_replace character filter
7.7 Tokenizers
7.7.1 The standard tokenizer
7.7.2 The ngram and edge_ngram tokenizers
7.7.3 Other tokenizers
7.8 Token filters
7.8.1 The stemmer filter
7.8.2 The shingle filter
7.8.3 The synonym filter
Summary
8 Introducing search
8.1 Overview
8.2 How does search work?
8.3 Movie sample data
8.4 Search fundamentals
8.4.1 The _search endpoint
8.4.2 Query vs. filter context
8.5 Anatomy of a request and a response
8.5.1 Search requests
8.5.2 Search responses
8.6 URI request searches
8.6.1 Searching for movies by title
8.6.2 Searching for a specific movie
8.6.3 Additional parameters
8.6.4 Supporting URI requests with Query DSL
8.7 Query DSL
8.7.1 Sample query
8.7.2 Query DSL for cURL
8.7.3 Query DSL for aggregations
8.7.4 Leaf and compound queries
8.8 Search features
8.8.1 Pagination
8.8.2 Highlighting
8.8.3 Explaining relevancy scores
8.8.4 Sorting
8.8.5 Manipulating results
8.8.6 Searching across indexes and data streams
Summary
9 Term-level search
9.1 Overview of term-level search
9.1.1 Term-level queries are not analyzed
9.1.2 Term-level query example
9.2 The term query
9.2.1 The term query on text fields
9.2.2 Example term query
9.2.3 Shortened term-level queries
9.3 The terms query
9.3.1 Example terms query
9.3.2 The terms lookup query
9.4 The ids query
9.5 The exists query
9.6 The range query
9.7 The wildcard query
9.8 The prefix query
9.8.1 Shortened queries
9.8.2 Speeding up prefix queries
9.9 Fuzzy queries
Summary
10 Full-text searches
10.1 Overview
10.1.1 Precision
10.1.2 Recall
10.2 Sample data
10.3 The match_all query
10.3.1 Building the match_all query
10.3.2 Short form of a match_all query
10.4 The match_none query
10.5 The match query
10.5.1 Format of a match query
10.5.2 Searching using a match query
10.5.3 Analyzing match queries
10.5.4 Searching for multiple words
10.5.5 Matching at least a few words
10.5.6 Fixing typos using the fuzziness keyword
10.6 The match_phrase query
10.7 The match_phrase_prefix query
10.8 The multi_match query
10.8.1 Best fields
10.8.2 The dis_max query
10.8.3 Tiebreakers
10.8.4 Boosting individual fields
10.9 The query_string query
10.9.1 Fields in a query_string query
10.9.2 Default operators
10.9.3 The query_string query with a phrase
10.10 Fuzzy queries
10.11 Simple string queries
10.12 The simple_query_string query
Summary
11 Compound queries
11.1 Sample product data
11.1.1 The products schema
11.1.2 Indexing products
11.2 Compound queries
11.3 The Boolean (bool) query
11.3.1 The bool query structure
11.3.2 The must clause
11.3.3 Enhancing the must clause
11.3.4 The must_not clause
11.3.5 Enhancing the must_not clause
11.3.6 The should clause
11.3.7 The filter clause
11.3.8 Combining all the clauses
11.3.9 Named queries
11.4 Constant scores
11.5 The boosting query
11.6 The disjunction max (dis_max) query
11.7 The function_score query
11.7.1 The random_score function
11.7.2 The script_score function
11.7.3 The field_value_factor function
11.7.4 Combining function scores
Summary
12 Advanced search
12.1 Introducing location search
12.1.1 The bounding_box query
12.1.2 The geo_distance query
12.1.3 The geo_shape query
12.2 Geospatial data types
12.2.1 The geo_point data type
12.2.2 The geo_shape data type
12.3 Geospatial queries
12.4 The geo_bounding_box query
12.5 The geo_distance query
12.6 The geo_shape query
12.7 The shape query
12.8 The span query
12.8.1 Sample data
12.8.2 The span_first query
12.8.3 The span_near query
12.8.4 The span_within query
12.8.5 The span_or query
12.9 Specialized queries
12.9.1 The distance_feature query
12.9.2 The pinned query
12.9.3 The more_like_this query
12.9.4 The percolate query
Summary
13 Aggregations
13.1 Overview
13.1.1 The endpoint and syntax
13.1.2 Combining searches and aggregations
13.1.3 Multiple and nested aggregations
13.1.4 Ignoring results
13.2 Metric aggregations
13.2.1 Sample data
13.2.2 The value_count metric
13.2.3 The avg metric
13.2.4 The sum metric
13.2.5 The min and max metrics
13.2.6 The stats metric
13.2.7 The extended_stats metric
13.2.8 The cardinality metric
13.3 Bucket aggregations
13.3.1 Histograms
13.3.2 Child-level aggregations
13.3.3 Custom range aggregations
13.3.4 The terms aggregation
13.3.5 The multi-terms aggregation
13.4 Parent and sibling aggregations
13.4.1 Parent aggregations
13.4.2 Sibling aggregations
13.5 Pipeline aggregations
13.5.1 Pipeline aggregation types
13.5.2 Sample data
13.5.3 Syntax for pipeline aggregations
13.5.4 Available pipeline aggregations
13.5.5 The cumulative_sum parent aggregation
13.5.6 The max_bucket and min_bucket sibling pipeline aggregations
Summary
14 Administration
14.1 Scaling the cluster
14.1.1 Adding nodes to the cluster
14.1.2 Cluster health
14.1.3 Increasing read throughput
14.2 Node communication
14.3 Shard sizing
14.3.1 Setting up a single index
14.3.2 Setting up multiple indexes
14.4 Snapshots
14.4.1 Getting started
14.4.2 Registering a snapshot repository
14.4.3 Creating snapshots
14.4.4 Restoring snapshots
14.4.5 Deleting snapshots
14.4.6 Automating snapshots
14.5 Advanced configurations
14.5.1 The main configuration file
14.5.2 Logging options
14.5.3 Java virtual machine options
14.6 Cluster masters
14.6.1 Master nodes
14.6.2 Master elections
14.6.3 Cluster state
14.6.4 A quorum
14.6.5 The split-brain problem
14.6.6 Dedicated master nodes
Summary
15 Performance and troubleshooting
15.1 Search and speed problems
15.1.1 Modern hardware
15.1.2 Document modeling
15.1.3 Choosing keyword types over text types
15.2 Index speed problems
15.2.1 System-generated identifiers
15.2.2 Bulk requests
15.2.3 Adjusting the refresh rate
15.3 Unstable clusters
15.3.1 Cluster is not GREEN
15.3.2 Unassigned shards
15.3.3 Disk-usage thresholds
15.4 Circuit breakers
15.5 Final words
Summary
Appendix A. Installation
A.1 Installing Elasticsearch
A.1.1 Downloading the Elasticsearch binary
A.1.2 Starting up on Windows
A.1.3 Starting up on macOS
A.1.4 Installing via Docker
A.1.5 Testing the server with the _cat API
A.2 Installing Kibana
A.2.1 Downloading the Kibana binary
A.2.2 Kibana on Windows
A.2.3 Kibana on macOS
A.2.4 Installing via Docker
Appendix B. Ingest pipelines
B.1 Overview
B.2 Mechanics of ingest pipelines
B.3 Loading PDFs into Elasticsearch
Appendix C. Clients
C.1 Java client
C.2 Background
C.3 Maven/Gradle project setup
C.4 Initialization
C.5 Namespace clients
C.6 Creating an index
C.7 Indexing documents
C.8 Searching
index

Polecaj historie