4Python Modules for Web Scraping
4Python Modules for Web Scraping
Advertisements
In this chapter, let us learn various Python modules that we can use for web scraping.
Now, we need to create a directory which will represent the project with the help of following command −
Now, enter into that directory with the help of this following command −
Now, activate the virtual environment with the command given below. Once successfully activated, you will see the
name of it on the left hand side in brackets.
(base) D:\ProgramData\webscrap>websc\scripts\activate
For deactivating the virtual environment, we can use the following command −
In this section, we are going to discuss about useful Python libraries for web scraping.
Requests
It is a simple python web scraping library. It is an efficient HTTP library used for accessing web pages. With the
help of Requests, we can get the raw HTML of web pages which can then be parsed for retrieving the data. Before
using requests, let us understand its installation.
Installing Requests
We can install it in either on our virtual environment or on the global installation. With the help of pip command,
we can easily install it as follows −
Example
In this example, we are making a GET HTTP request for a web page. For this we need to first import requests
library as follows −
In this following line of code, we use requests to make a GET HTTP requests for the url:
https://authoraditiagarwal.com/ by making a GET request.
In [2]: r = requests.get('https://authoraditiagarwal.com/')
In [5]: r.text[:200]
Observe that in the following output, we got the first 200 characters.
Urllib3
It is another Python library that can be used for retrieving data from URLs similar to the requests library. You
can read more on this at its technical documentation at https://urllib3.readthedocs.io/en/latest/.
Installing Urllib3
Using the pip command, we can install urllib3 either in our virtual environment or in global installation.
(base) D:\ProgramData>pip install urllib3
Collecting urllib3
Using cached
https://files.pythonhosted.org/packages/bd/c9/6fdd990019071a4a32a5e7cb78a1d92c5
3851ef4f56f62a3486e6a7d8ffb/urllib3‐1.23‐py2.py3‐none‐any.whl
Installing collected packages: urllib3
Successfully installed urllib3‐1.23
In the following example, we are scraping the web page by using Urllib3 and BeautifulSoup. We are using
Urllib3 at the place of requests library for getting the raw data H T M L from web page. Then we are using
BeautifulSoup for parsing that HTML data.
import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
r = http.request('GET', 'https://authoraditiagarwal.com')
soup = BeautifulSoup(r.data, 'lxml')
print (soup.title)
print (soup.title.text)
This is the output you will observe when you run this code −
Selenium
It is an open source automated testing suite for web applications across different browsers and platforms. It is not
a single tool but a suite of software. We have selenium bindings for Python, Java, C#, Ruby and JavaScript. Here
we are going to perform web scraping by using selenium and its Python bindings. You can learn more about
Selenium with Java on the link Selenium.
Selenium Python bindings provide a convenient API to access Selenium WebDrivers like Firefox, IE, Chrome,
Remote etc. The current supported Python versions are 2.7, 3.5 and above.
Installing Selenium
Using the pip command, we can install urllib3 either in our virtual environment or in global installation.
As selenium requires a driver to interface with the chosen browser, we need to download it. The following table
shows different browsers and their links for downloading the same.
Chrome https://sites.google.com/a/chromium.org/
Edge https://developer.microsoft.com/
Firefox https://github.com/
Safari https://webkit.org/
Example
This example shows web scraping using selenium. It can also be used for testing which is called selenium testing.
After downloading the particular driver for the specified version of browser, we need to do programming in
Python.
Now, provide the path of web driver which we have downloaded as per our requirement −
path = r'C:\\Users\\gaurav\\Desktop\\Chromedriver'
browser = webdriver.Chrome(executable_path = path)
Now, provide the url which we want to open in that web browser now controlled by our Python script.
browser.get('https://authoraditiagarwal.com/leadershipmanagement')
We can also scrape a particular element by providing the xpath as provided in lxml.
browser.find_element_by_xpath('/html/body').click()
You can check the browser, controlled by Python script, for output.
Scrapy
Scrapy is a fast, opensource web crawling framework written in Python, used to extract the data from the web
page with the help of selectors based on XPath. Scrapy was first released on June 26, 2008 licensed under BSD,
with a milestone 1.0 releasing in June 2015. It provides us all the tools we need to extract, process and structure
the data from websites.
Installing Scrapy
Using the pip command, we can install urllib3 either in our virtual environment or in global installation.
For more detail study of Scrapy you can go to the link Scrapy