Python Web Scraping [2nd ed] 9781786464293, 1786464292

Chapter 5: Dynamic Content ; An example dynamic web page; Reverse engineering a dynamic web page; Edge cases; Rendering

990 188 6MB

English Pages 215 Year 2017

Report DMCA / Copyright

DOWNLOAD FILE

Python Web Scraping [2nd ed]
 9781786464293, 1786464292

Table of contents :
Content: Cover
Credits
Copyright
About the Authors
About the Reviewers
www.PacktPub.com
Customer Feedback
Table of Contents
Preface
Chapter 1: Introduction to Web Scraping
When is web scraping useful?
Is web scraping legal?
Python 3
Background research
Checking robots.txt
Examining the Sitemap
Estimating the size of a website
Identifying the technology used by a website
Finding the owner of a website
Crawling your first website
Scraping versus crawling
Downloading a web page
Retrying downloads
Setting a user agent
Sitemap crawler
ID iteration crawler
Link crawlers. Advanced featuresParsing robots.txt
Supporting proxies
Throttling downloads
Avoiding spider traps
Final version
Using the requests library
Summary
Chapter 2: Scraping the Data
Analyzing a web page
Three approaches to scrape a web page
Regular expressions
Beautiful Soup
Lxml
CSS selectors and your Browser Console
XPath Selectors
LXML and Family Trees
Comparing performance
Scraping results
Overview of Scraping
Adding a scrape callback to the link crawler
Summary
Chapter 3: Caching Downloads
When to use caching?
Adding cache support to the link crawler
Disk Cache. Implementing DiskCacheTesting the cache
Saving disk space
Expiring stale data
Drawbacks of DiskCache
Key-value storage cache
What is key-value storage?
Installing Redis
Overview of Redis
Redis cache implementation
Compression
Testing the cache
Exploring requests-cache
Summary
Chapter 4: Concurrent Downloading
One million web pages
Parsing the Alexa list
Sequential crawler
Threaded crawler
How threads and processes work
Implementing a multithreaded crawler
Multiprocessing crawler
Performance
[Python multiprocessing and the GIL]
Python multiprocessing and the GIL. Optical character recognitionFurther improvements
Solving complex CAPTCHAs
Using a CAPTCHA solving service
Getting started with 9kw
The 9kw CAPTCHA API
Reporting errors
Integrating with registration
CAPTCHAs and machine learning
Summary
Chapter 8: Scrapy
Installing Scrapy
Starting a project
Defining a model
Creating a spider
Tuning settings
Testing the spider
Different Spider Types
Scraping with the shell command
Checking results
Interrupting and resuming a crawl
Scrapy Performance Tuning
Visual scraping with Portia
Installation
Annotation
Running the Spider.

Polecaj historie