Web Scraping With Python - Sample Chapter
Web Scraping With Python - Sample Chapter
Web Scraping
with Python
C o m m u n i t y
Richard Lawson
$ 34.99 US
22.99 UK
P U B L I S H I N G
pl
Sa
m
ee
D i s t i l l e d
Web Scraping
with Python
Scrape data from any website with the power of Python
E x p e r i e n c e
Richard Lawson
Preface
The Internet contains the most useful set of data ever assembled, which is largely
publicly accessible for free. However, this data is not easily reusable. It is embedded
within the structure and style of websites and needs to be extracted to be useful.
This process of extracting data from web pages is known as web scraping and is
becoming increasingly useful as ever more information is available online.
Preface
Chapter 8, Scrapy, teaches you how to use the popular high-level Scrapy framework.
Chapter 9, Overview, is an overview of web scraping techniques that have been covered.
[1]
Background research
Before diving into crawling a website, we should develop an understanding about
the scale and structure of our target website. The website itself can help us through
their robots.txt and Sitemap files, and there are also external tools available to
provide further details such as Google Search and WHOIS.
[2]
Chapter 1
Checking robots.txt
Most websites define a robots.txt file to let crawlers know of any restrictions about
crawling their website. These restrictions are just a suggestion but good web citizens
will follow them. The robots.txt file is a valuable resource to check before crawling
to minimize the chance of being blocked, and also to discover hints about a website's
structure. More information about the robots.txt protocol is available at http://
www.robotstxt.org. The following code is the content of our example robots.txt,
which is available at http://example.webscraping.com/robots.txt:
# section 1
User-agent: BadCrawler
Disallow: /
# section 2
User-agent: *
Crawl-delay: 5
Disallow: /trap
# section 3
Sitemap: http://example.webscraping.com/sitemap.xml
In section 1, the robots.txt file asks a crawler with user agent BadCrawler not to
crawl their website, but this is unlikely to help because a malicious crawler would
not respect robots.txt anyway. A later example in this chapter will show you how
to make your crawler follow robots.txt automatically.
Section 2 specifies a crawl delay of 5 seconds between download requests for all
User-Agents, which should be respected to avoid overloading their server. There is
also a /trap link to try to block malicious crawlers who follow disallowed links. If you
visit this link, the server will block your IP for one minute! A real website would block
your IP for much longer, perhaps permanently, but then we could not continue with
this example.
Section 3 defines a Sitemap file, which will be examined in the next section.
[3]
without needing to crawl every web page. For further details, the sitemap standard
is defined at http://www.sitemaps.org/protocol.html. Here is the content of the
Sitemap file discovered in the robots.txt file:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://example.webscraping.com/view/Afghanistan-1
</loc></url>
<url><loc>http://example.webscraping.com/view/Aland-Islands-2
</loc></url>
<url><loc>http://example.webscraping.com/view/Albania-3</loc>
</url>
...
</urlset>
This sitemap provides links to all the web pages, which will be used in the next
section to build our first crawler. Sitemap files provide an efficient way to crawl a
website, but need to be treated carefully because they are often missing, out of date,
or incomplete.
[4]
Chapter 1
Here are the site search results for our example website when searching Google for
site:example.webscraping.com:
As we can see, Google currently estimates 202 web pages, which is about as
expected. For larger websites, I have found Google's estimates to be less accurate.
We can filter these results to certain parts of the website by adding a URL path to
the domain. Here are the results for site:example.webscraping.com/view, which
restricts the site search to the country web pages:
[5]
This additional filter is useful because ideally you will only want to crawl the part of
a website containing useful data rather than every page of it.
This module will take a URL, download and analyze it, and then return the
technologies used by the website. Here is an example:
>>> import builtwith
>>> builtwith.parse('http://example.webscraping.com')
{u'javascript-frameworks': [u'jQuery', u'Modernizr', u'jQuery UI'],
u'programming-languages': [u'Python'],
u'web-frameworks': [u'Web2py', u'Twitter Bootstrap'],
u'web-servers': [u'Nginx']}
We can see here that the example website uses the Web2py Python web framework
alongside with some common JavaScript libraries, so its content is likely embedded
in the HTML and be relatively straightforward to scrape. If the website was instead
built with AngularJS, then its content would likely be loaded dynamically. Or, if
the website used ASP.NET, then it would be necessary to use sessions and form
submissions to crawl web pages. Working with these more difficult cases will be
covered later in Chapter 5, Dynamic Content and Chapter 6, Interacting with Forms.
Here is the key part of the WHOIS response when querying the appspot.com domain
with this module:
>>> import whois
>>> print whois.whois('appspot.com')
[6]
Chapter 1
{
...
"name_servers": [
"NS1.GOOGLE.COM",
"NS2.GOOGLE.COM",
"NS3.GOOGLE.COM",
"NS4.GOOGLE.COM",
"ns4.google.com",
"ns2.google.com",
"ns1.google.com",
"ns3.google.com"
],
"org": "Google Inc.",
"emails": [
"abusecomplaints@markmonitor.com",
"dns-admin@google.com"
]
}
We can see here that this domain is owned by Google, which is correctthis domain
is for the Google App Engine service. Google often blocks web crawlers despite
being fundamentally a web crawling business themselves. We would need to be
careful when crawling this domain because Google often blocks web crawlers,
despite being fundamentally a web crawling business themselves.
Crawling a sitemap
[7]
When a URL is passed, this function will download the web page and return the
HTML. The problem with this snippet is that when downloading the web page, we
might encounter errors that are beyond our control; for example, the requested page
may no longer exist. In these cases, urllib2 will raise an exception and exit the
script. To be safer, here is a more robust version to catch these exceptions:
import urllib2
def download(url):
print 'Downloading:', url
try:
html = urllib2.urlopen(url).read()
except urllib2.URLError as e:
print 'Download error:', e.reason
html = None
return html
Now, when a download error is encountered, the exception is caught and the
function returns None.
Retrying downloads
Often, the errors encountered when downloading are temporary; for example, the
web server is overloaded and returns a 503 Service Unavailable error. For these
errors, we can retry the download as the server problem may now be resolved.
However, we do not want to retry downloading for all errors. If the server returns
404 Not Found, then the web page does not currently exist and the same request is
unlikely to produce a different result.
[8]
Chapter 1
The full list of possible HTTP errors is defined by the Internet Engineering
Task Force, and is available for viewing at https://tools.ietf.org/html/
rfc7231#section-6. In this document, we can see that the 4xx errors occur when
there is something wrong with our request and the 5xx errors occur when there is
something wrong with the server. So, we will ensure our download function only
retries the 5xx errors. Here is the updated version to support this:
def download(url, num_retries=2):
print 'Downloading:', url
try:
html = urllib2.urlopen(url).read()
except urllib2.URLError as e:
print 'Download error:', e.reason
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# recursively retry 5xx HTTP errors
return download(url, num_retries-1)
return html
Now, when a download error is encountered with a 5xx code, the download is
retried by recursively calling itself. The function now also takes an additional
argument for the number of times the download can be retried, which is set to two
times by default. We limit the number of times we attempt to download a web page
because the server error may not be resolvable. To test this functionality we can try
downloading http://httpstat.us/500, which returns the 500 error code:
>>> download('http://httpstat.us/500')
Downloading: http://httpstat.us/500
Download error: Internal Server Error
Downloading: http://httpstat.us/500
Download error: Internal Server Error
Downloading: http://httpstat.us/500
Download error: Internal Server Error
As expected, the download function now tries downloading the web page, and then
on receiving the 500 error, it retries the download twice before giving up.
[9]
So, to download reliably, we will need to have control over setting the user agent.
Here is an updated version of our download function with the default user agent
set to 'wswp' (which stands for Web Scraping with Python):
def download(url, user_agent='wswp', num_retries=2):
print 'Downloading:', url
headers = {'User-agent': user_agent}
request = urllib2.Request(url, headers=headers)
try:
html = urllib2.urlopen(request).read()
except urllib2.URLError as e:
print 'Download error:', e.reason
html = None
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# retry 5XX HTTP errors
return download(url, user_agent, num_retries-1)
return html
Now we have a flexible download function that can be reused in later examples to
catch errors, retry the download when possible, and set the user agent.
[ 10 ]
Chapter 1
Sitemap crawler
For our first simple crawler, we will use the sitemap discovered in the example
website's robots.txt to download all the web pages. To parse the sitemap, we will
use a simple regular expression to extract URLs within the <loc> tags. Note that a
more robust parsing approach called CSS selectors will be introduced in the next
chapter. Here is our first example crawler:
def crawl_sitemap(url):
# download the sitemap file
sitemap = download(url)
# extract the sitemap links
links = re.findall('<loc>(.*?)</loc>', sitemap)
# download each link
for link in links:
html = download(link)
# scrape html here
# ...
Now, we can run the sitemap crawler to download all countries from the
example website:
>>> crawl_sitemap('http://example.webscraping.com/sitemap.xml')
Downloading: http://example.webscraping.com/sitemap.xml
Downloading: http://example.webscraping.com/view/Afghanistan-1
Downloading: http://example.webscraping.com/view/Aland-Islands-2
Downloading: http://example.webscraping.com/view/Albania-3
...
This works as expected, but as discussed earlier, Sitemap files often cannot be relied
on to provide links to every web page. In the next section, another simple crawler
will be introduced that does not depend on the Sitemap file.
ID iteration crawler
In this section, we will take advantage of weakness in the website structure to easily
access all the content. Here are the URLs of some sample countries:
http://example.webscraping.com/view/Afghanistan-1
http://example.webscraping.com/view/Australia-2
http://example.webscraping.com/view/Brazil-3
[ 11 ]
We can see that the URLs only differ at the end, with the country name (known
as a slug) and ID. It is a common practice to include a slug in the URL to help
with search engine optimization. Quite often, the web server will ignore the slug
and only use the ID to match with relevant records in the database. Let us check
whether this works with our example website by removing the slug and loading
http://example.webscraping.com/view/1:
The web page still loads! This is useful to know because now we can ignore the slug
and simply iterate database IDs to download all the countries. Here is an example
code snippet that takes advantage of this trick:
import itertools
for page in itertools.count(1):
url = 'http://example.webscraping.com/view/-%d' % page
html = download(url)
if html is None:
break
else:
# success - can scrape the result
pass
[ 12 ]
Chapter 1
The crawler in the preceding code now needs to encounter five consecutive
download errors to stop iterating, which decreases the risk of stopping the iteration
prematurely when some records have been deleted.
Iterating the IDs is a convenient approach to crawl a website, but is similar to the
sitemap approach in that it will not always be available. For example, some websites
will check whether the slug is as expected and if not return a 404 Not Found error.
Also, other websites use large nonsequential or nonnumeric IDs, so iterating is not
practical. For example, Amazon uses ISBNs as the ID for their books, which have at
least ten digits. Using an ID iteration with Amazon would require testing billions of
IDs, which is certainly not the most efficient approach to scraping their content.
[ 13 ]
Link crawler
So far, we have implemented two simple crawlers that take advantage of the
structure of our sample website to download all the countries. These techniques
should be used when available, because they minimize the required amount of web
pages to download. However, for other websites, we need to make our crawler act
more like a typical user and follow links to reach the content of interest.
We could simply download the entire website by following all links. However, this
would download a lot of web pages that we do not need. For example, to scrape user
account details from an online forum, only account pages need to be downloaded and
not discussion threads. The link crawler developed here will use a regular expression
to decide which web pages to download. Here is an initial version of the code:
import re
def link_crawler(seed_url, link_regex):
"""Crawl from the given seed URL following links matched by link_regex
"""
crawl_queue = [seed_url]
while crawl_queue:
url = crawl_queue.pop()
html = download(url)
# filter for links matching our regular expression
for link in get_links(html):
if re.match(link_regex, link):
crawl_queue.append(link)
def get_links(html):
"""Return a list of links from html
"""
# a regular expression to extract all links from the webpage
webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']',
re.IGNORECASE)
# list of all links from the webpage
return webpage_regex.findall(html)
To run this code, simply call the link_crawler function with the URL of the website
you want to crawl and a regular expression of the links that you need to follow. For
the example website, we want to crawl the index with the list of countries and the
countries themselves. The index links follow this format:
http://example.webscraping.com/index/1
http://example.webscraping.com/index/2
[ 14 ]
Chapter 1
http://example.webscraping.com/view/Afghanistan-1
http://example.webscraping.com/view/Aland-Islands-2
The problem with downloading /index/1 is that it only includes the path of the
web page and leaves out the protocol and server, which is known as a relative link.
Relative links work when browsing because the web browser knows which web
page you are currently viewing. However, urllib2 is not aware of this context.
To help urllib2 locate the web page, we need to convert this link into an absolute
link, which includes all the details to locate the web page. As might be expected,
Python includes a module to do just this, called urlparse. Here is an improved
version of link_crawler that uses the urlparse module to create the absolute links:
import urlparse
def link_crawler(seed_url, link_regex):
"""Crawl from the given seed URL following links matched by link_regex
"""
crawl_queue = [seed_url]
while crawl_queue:
url = crawl_queue.pop()
html = download(url)
for link in get_links(html):
if re.match(link_regex, link):
link = urlparse.urljoin(seed_url, link)
crawl_queue.append(link)
[ 15 ]
When this example is run, you will find that it downloads the web pages without
errors; however, it keeps downloading the same locations over and over. The reason
for this is that these locations have links to each other. For example, Australia links to
Antarctica and Antarctica links right back, and the crawler will cycle between these
forever. To prevent re-crawling the same links, we need to keep track of what has
already been crawled. Here is the updated version of link_crawler that stores the
URLs seen before, to avoid redownloading duplicates:
def link_crawler(seed_url, link_regex):
crawl_queue = [seed_url]
# keep track which URL's have seen before
seen = set(crawl_queue)
while crawl_queue:
url = crawl_queue.pop()
html = download(url)
for link in get_links(html):
# check if link matches expected regex
if re.match(link_regex, link):
# form absolute link
link = urlparse.urljoin(seed_url, link)
# check if have already seen this link
if link not in seen:
seen.add(link)
crawl_queue.append(link)
When this script is run, it will crawl the locations and then stop as expected. We
finally have a working crawler!
Advanced features
Now, let's add some features to make our link crawler more useful for crawling
other websites.
Parsing robots.txt
Firstly, we need to interpret robots.txt to avoid downloading blocked URLs.
Python comes with the robotparser module, which makes this straightforward,
as follows:
>>> import robotparser
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url('http://example.webscraping.com/robots.txt')
>>> rp.read()
>>> url = 'http://example.webscraping.com'
[ 16 ]
Chapter 1
>>> user_agent = 'BadCrawler'
>>> rp.can_fetch(user_agent, url)
False
>>> user_agent = 'GoodCrawler'
>>> rp.can_fetch(user_agent, url)
True
The robotparser module loads a robots.txt file and then provides a can_fetch()
function, which tells you whether a particular user agent is allowed to access a web
page or not. Here, when the user agent is set to 'BadCrawler', the robotparser
module says that this web page can not be fetched, as was defined in robots.txt
of the example website.
To integrate this into the crawler, we add this check in the crawl loop:
...
while crawl_queue:
url = crawl_queue.pop()
# check url passes robots.txt restrictions
if rp.can_fetch(user_agent, url):
...
else:
print 'Blocked by robots.txt:', url
Supporting proxies
Sometimes it is necessary to access a website through a proxy. For example, Netflix
is blocked in most countries outside the United States. Supporting proxies with
urllib2 is not as easy as it could be (for a more user-friendly Python HTTP module,
try requests, documented at http://docs.python-requests.org/). Here is how
to support a proxy with urllib2:
proxy = ...
opener = urllib2.build_opener()
proxy_params = {urlparse.urlparse(url).scheme: proxy}
opener.add_handler(urllib2.ProxyHandler(proxy_params))
response = opener.open(request)
[ 17 ]
Throttling downloads
If we crawl a website too fast, we risk being blocked or overloading the server.
To minimize these risks, we can throttle our crawl by waiting for a delay between
downloads. Here is a class to implement this:
class Throttle:
"""Add a delay between downloads to the same domain
"""
def __init__(self, delay):
# amount of delay between downloads for each domain
self.delay = delay
# timestamp of when a domain was last accessed
self.domains = {}
def wait(self, url):
domain = urlparse.urlparse(url).netloc
last_accessed = self.domains.get(domain)
if self.delay > 0 and last_accessed is not None:
sleep_secs = self.delay - (datetime.datetime.now() last_accessed).seconds
if sleep_secs > 0:
# domain has been accessed recently
# so need to sleep
time.sleep(sleep_secs)
# update the last accessed time
self.domains[domain] = datetime.datetime.now()
[ 18 ]
Chapter 1
This Throttle class keeps track of when each domain was last accessed and will
sleep if the time since the last access is shorter than the specified delay. We can add
throttling to the crawler by calling throttle before every download:
throttle = Throttle(delay)
...
throttle.wait(url)
result = download(url, headers, proxy=proxy,
num_retries=num_retries)
Now, with this feature, we can be confident that the crawl will always complete
eventually. To disable this feature, max_depth can be set to a negative number so
that the current depth is never equal to it.
[ 19 ]
Final version
The full source code for this advanced link crawler can be downloaded at
https://bitbucket.org/wswp/code/src/tip/chapter01/link_crawler3.py.
To test this, let us try setting the user agent to BadCrawler, which we saw earlier
in this chapter was blocked by robots.txt. As expected, the crawl is blocked and
finishes immediately:
Now, let's try using the default user agent and setting the maximum depth to 1 so
that only the links from the home page are downloaded:
>>> link_crawler(seed_url, link_regex, max_depth=1)
Downloading: http://example.webscraping.com//index
Downloading: http://example.webscraping.com/index/1
Downloading: http://example.webscraping.com/view/Antigua-and-Barbuda-10
Downloading: http://example.webscraping.com/view/Antarctica-9
Downloading: http://example.webscraping.com/view/Anguilla-8
Downloading: http://example.webscraping.com/view/Angola-7
Downloading: http://example.webscraping.com/view/Andorra-6
Downloading: http://example.webscraping.com/view/American-Samoa-5
Downloading: http://example.webscraping.com/view/Algeria-4
Downloading: http://example.webscraping.com/view/Albania-3
Downloading: http://example.webscraping.com/view/Aland-Islands-2
Downloading: http://example.webscraping.com/view/Afghanistan-1
As expected, the crawl stopped after downloading the first page of countries.
Summary
This chapter introduced web scraping and developed a sophisticated crawler that
will be reused in the following chapters. We covered the usage of external tools and
modules to get an understanding of a website, user agents, sitemaps, crawl delays,
and various crawling strategies.
In the next chapter, we will explore how to scrape data from the crawled web pages.
[ 20 ]
www.PacktPub.com
Stay Connected: