Web Scraping Handbook
Web Scraping Handbook
Kevin Sahin
This book is for sale at http://leanpub.com/webscrapinghandbook
This is a Leanpub book. Leanpub empowers authors and publishers with the
Lean Publishing process. Lean Publishing is the act of publishing an
in-progress ebook using lightweight tools and many iterations to get reader
feedback, pivot until you have the right book and build traction once you do.
Selenium API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Infinite scroll . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Captcha solving, PDF parsing, and OCR . . . . . . . . . . . . . . . . . . . . 77
Captcha solving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
PDF parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Optical Caracter Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 90
Stay under cover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Headers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Proxies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
TOR: The Onion Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Cloud scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Serverless . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Deploying an Azure function . . . . . . . . . . . . . . . . . . . . . . . . . 103
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Introduction to Web scraping
Web scraping or crawling is the act of fetching data from a third party website
by downloading and parsing the HTML code to extract the data you want. It
can be done manually, but generally this term refers to the automated process
of downloading the HTML content of a page, parsing/extracting the data, and
saving it into a database for further analysis or use.
curl https://api.coinmarketcap.com/v1/ticker/ethereum/?convert=EUR
¹https://en.wikipedia.org/wiki/Application_programming_interface
Introduction to Web scraping 2
{
id: "ethereum",
name: "Ethereum",
symbol: "ETH",
rank: "2",
price_usd: "414.447",
price_btc: "0.0507206",
24h_volume_usd: "1679960000.0",
market_cap_usd: "39748509988.0",
available_supply: "95907342.0",
total_supply: "95907342.0",
max_supply: null,
percent_change_1h: "0.64",
percent_change_24h: "13.38",
percent_change_7d: "25.56",
last_updated: "1511456952",
price_eur: "349.847560557",
24h_volume_eur: "1418106314.76",
market_cap_eur: "33552949485.0"
}
We could also imagine that an E-commerce website has an API that lists every
product through this endpoint :
curl https://api.e-commerce.com/products
curl https://api.e-commerce.com/products/123
Since not every website offers a clean API, or an API at all, web scraping can
be the only solution when it comes to extracting website informations.
Introduction to Web scraping 3
APIs are generally easier to use, the problem is that lots of web-
sites don’t offer any API. Building an API can be a huge cost for
companies, you have to ship it, test it, handle versioning, create the
documentation, there are infrastructure costs, engineering costs etc.
The second issue with APIs is that sometimes there are rate limits
(you are only allowed to call a certain endpoint X times per day/hour),
and the third issue is that the data can be incomplete.
The good news is : almost everything that you can see in your browser can
be scraped.
As you can see, there are many use cases to web scraping.
Mint.com screenshot
With this process, Mint is able to support any bank, regardless of the existance
of an API, and no matter what backend/frontend technology the bank uses.
That’s a good example of how useful and powerful web scraping is. The
Introduction to Web scraping 5
drawback of course, is that each time a bank changes its website (even a simple
change in the HTML), the robots will have to be changed as well.
Parsely Dashboard
Parse.ly is a startup providing analytics for publishers. Its plateform crawls the
entire publisher website to extract all posts (text, meta-data…) and perform
Natural Language Processing to categorize the key topics/metrics. It allows
publishers to understand what underlying topics the audiance likes or dislikes.
Introduction to Web scraping 6
to collect data with web scraping, how to inspect websites with Chrome dev
tools, parse HTML and store the data. You will learn how to handle javascript
heavy websites, find hidden APIs, break captchas and how to avoid the classic
traps and anti-scraping techniques.
Learning web scraping can be challenging, this is why I aim at explaining just
enough theory to understand the concepts, and immediatly apply this theory
with practical and down to earth examples. We will focus on Java, but all
the techniques we will see can be implemented in many other languages, like
Python, Javascript, or Go.
Web fundamentals
The internet is really complex : there are many underlying techologies and
concepts involved to view a simple web page in your browser. I don’t have
the pretention to explain everything, but I will show you the most important
things you have to understand to extract data from the web.
Http request
In the first line of this request, you can see the GET verb or method being used,
meaning we request data from the specific path : /how-to-log-in-to-almost-any-websites/
There are other HTTP verbs, you can see the full list here². Then you can see
the version of the HTTP protocol, in this book we will focus on HTTP 1. Note
that as of Q4 2017, only 20% of the top 10 million websites supports HTTP/2.
And finally, there is a key-value list called headers Here is the most important
header fields :
And the list goes on…you can find the full header list here³
The server responds with a message like this :
Http response
HTTP/1.1 200 OK
Server: nginx/1.4.6 (Ubuntu)
Content-Type: text/html; charset=utf-8
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
...[HTML CODE]
On the first line, we have a new piece of information, the HTTP code 200 OK.
It means the request has succeeded. As for the request headers, there are lots
of HTTP codes, split in four common classes :
³https://en.wikipedia.org/wiki/List_of_HTTP_header_fields
Web fundamentals 11
Then, in case you are sending this HTTP request with your web browser, the
browser will parse the HTML code, fetch all the eventual assets (Javascript
files, CSS files, images…) and it will render the result into the main window.
HTML page
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>What is the DOM ?</title>
</head>
<body>
<h1>DOM 101</h1>
<p>Websraping is awsome !</p>
<p>Here is my <a href="https://ksah.in">blog</a></p>
</body>
</html>
This HTML code is basically HTML content encapsulated inside other HTML
content. The HTML hierarchy can be viewed as a tree. We can already see
this hierarchy through the indentation in the HTML code. When your web
browser parses this code, it will create a tree which is an object representation
of the HTML document. It is called the Document Oject Model. Below is the
internal tree structure inside Google Chrome inspector :
Chrome Inspector
On the left we can see the HTML tree, and on the right we have the Javascript
Web fundamentals 13
object representing the currently selected element (in this case, the <p> tag),
with all its attributes. And here is the tree structure for this HTML code :
The important thing to remember is that the DOM you see in your
browser, when you right click + inspect can be really different from
the actual HTML that was sent. Maybe some Javascript code was
executed and dynamically changed the DOM ! For example, when
you scroll on your twitter account, a request is sent by your browser
to fetch new tweets, and some Javascript code is dynamically adding
those new tweets to the DOM.
Dom Diagram
The root node of this tree is the <html> tag. It contains two children :
<head>and <body>. There are lots of node types in the DOM specification⁴ but
here is the most important one :
You can see the full list here⁵. Now let’s write some Javascript code to
understand all of this :
First let’s see how many child nodes our <head> element has, and show the list.
To do so, we will write some Javascript code inside the Chrome console. The
document object in Javascript is the owner of all other objects in the web page
(including every DOM nodes.)
We want to make sure that we have two child nodes for our head element. It’s
simple :
How many childnodes ?
document.head.childNodes.length
⁵https://developer.mozilla.org/en-US/docs/Web/API/Node
Web fundamentals 15
Javascript example
What an unexpected result ! It shows five nodes instead of the expected two.
Web fundamentals 17
We can see with the for loop that three text nodes were added. If you click on
the this text nodes in the console, you will see that the text content is either
a linebreak or tabulation (\n or \t ). In most modern browsers, a text node is
created for each whitespace outside a HTML tags.
In the next chapters, we will not use directly the Javascript API to manipulate
the DOM, but a similar API directly in Java. I think it is important to know
how things works in Javascript before doing it with other languages.
• Headless browser
• Do things more “manually” : Use an HTTP library to perform the GET
request, then use a library like Jsoup⁷ to parse the HTML and extract the
data you want
Each option has its pros and cons. A headless browser is like a normal web
browser, without the Graphical User Interface. It is often used for QA reasons,
to perform automated testing on websites. There are lots of different headless
browsers, like Headless Chrome⁸, PhantomJS⁹, HtmlUnit¹⁰, we will see this
⁶https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Whitespace_in_the_DOM
⁷https://jsoup.org/
⁸https://chromium.googlesource.com/chromium/src/+/lkgr/headless/README.md
⁹http://phantomjs.org/
¹⁰http://htmlunit.sourceforge.net/
Web fundamentals 18
later. The good thing about a headless browser is that it can take care of lots of
things : Parsing the HTML, dealing with authentication cookies, fill in forms,
execute Javascript functions, access iFrames… The drawback is that there is of
course some overhead compared to using a plain HTTP library and a parsing
library.
In the next three sections we will see how to select and extract data inside
HTML pages, with Xpath, CSS selectors and regular expressions.
Xpath
Xpath is a technology that uses path expressions to select nodes or node-
sets in an XML document (or HTML document). As with the Document
Object Model, Xpath is a W3C standard since 1999. Even if Xpath is not a
programming language in itself, it allows you to write expression that can
access directly to a specific node, or a specific node set, without having to go
through the entire HTML tree (or XML tree).
Entire books has been written on Xpath, and as I said before I don’t have the
pretention to explain everything in depth, this is an introduction to Xpath and
we will see through real examples how you can use it for your web scraping
needs.
We will use the following HTML document for the examples below:
Web fundamentals 19
HTML example
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>Xpath 101</title>
</head>
<body>
<div class="product">
<header>
<hgroup>
<h1>Amazing product #1</h1>
<h3>The best product ever made</h4>
</hgroup>
</header>
<figure>
<img src="http://lorempixel.com/400/200">
</figure>
<section>
<p>Text text text</p>
<details>
<summary>Product Features</summary>
<ul>
<li>Feature 1</li>
<li class="best-feature">Feature 2</li>
<li id="best-id">Feature 3</li>
</ul>
</details>
<button>Buy Now</button>
</section>
Web fundamentals 20
</div>
</body>
</html>
Xpath Syntax
There are different types of expressions to select a node in an HTML document,
here are the most important ones :
Xpath Expression Description
nodename This is the simplest one, it select all nodes
with this nodename
/ Selects from the root node (useful for
writing absolute path)
// Selects nodes from the current node that
matches
. Selects the current node
.. Selects the current node’s parent
Web fundamentals 21
You can also use predicates to find a node that contains a specific value.
Predicate are always in square brackets : [predicate] Here are some examples
:
Xpath Expression Description
//li[last()] Selects the last li element
//div[@class='product'] Selects all div elements that have the
class attribute with the product value.
//li[3] Selects the third li element (the index
starts at 1)
//div[@class='product'] Selects all div elements that have the
class attribute with the product value.
Now we will see some example of Xpath expressions. We can test XPath
expressions inside Chrome Dev tools, so it is time to fire up Chrome. To do so,
right click on the web page -> inspect and then cmd + f on a Mac or ctrl + f
on other systems, then you can enter an Xpath expression, and the match will
be highlighted in the Dev tool.
Web fundamentals 22
Web fundamentals 23
In the dev tools, you can right click on any DOM node, and show its
full Xpath expression, that you can later factorize. There is a lot more
that we could discuss about Xpath, but it is out of this book’s scope,
I suggest you to read this great W3School tutorial¹¹ if you want to
learn more.
In the next chapter we will see how to use Xpath expression inside our Java
scraper to select HTML nodes containing the data we want to extract.
Regular Expression
A regular expression (RE, or Regex) is a search pattern for strings. With regex,
you can search for a particular character/word inside a bigger body of text.
For example you could identify all phone numbers inside a web page. You can
also replace items, for example you could replace all upercase tag in a poorly
formatted HTML by lowercase ones. You can also validate some inputs …
The pattern used by the regex is applied from left to right. Each source
character is only used once. For example, this regex : oco will match the string
ococo only once, because there is only one distinct sub-string that matches.
¹¹https://www.w3schools.com/xml/xpath_intro.asp
Web fundamentals 24
Regex Description
? Matches zero or one of the preceeding item
{x} Matches exactly x of the preceeding item
\d Matches any digit
D Matches any non digit
\s Matches any whitespace character
S Matches any non-whitespace character
(expression) Capture the group matched inside the parenthesis
<p>Price : 19.99$</p>
We could select this text node with an Xpath expression, and then use this
kind a regex to extract the price :
^Price\s:\s(\d+\.\d{2})\$
¹²https://en.wikipedia.org/wiki/Semantic_Web
Web fundamentals 25
(?:[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*|"(\
?:[\x01-
\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e\x\
7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[\
a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]\
|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\\
x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
¹³https://tools.ietf.org/html/rfc2822#section-3.4.1
¹⁴https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf
¹⁵https://regex101.com/
Extracting the data you want
For our first exemple, we are going to fetch items from Hacker News, although
they offer a nice API, let’s pretend they don’t.
Tools
You will need Java 8 with HtmlUnit¹⁶. HtmlUnit is a Java headless browser, it
is this library that will allow you to perform HTTP requests on websites, and
parse the HTML content.
pom.xml
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.28</version>
</dependency>
If you are using Eclipse, I suggest you configure the max length in the detail
pane (when you click in the variables tab ) so that you will see the entire HTML
of your current page.
¹⁶http://htmlunit.sourceforge.net
Extracting the data you want 27
Now you can open your favorite IDE, it is time to code. HtmlUnit needs a
WebClient to make a request. There are many options (Proxy settings, browser,
redirect enabled …) We are going to disable Javascript since it’s not required
for our example, and disabling Javascript makes the page load faster in general
(in this specific case, it does not matter). Then we perform a GET request to
Extracting the data you want 28
the hacker news’s URL, and print the HTML content we received from the
server.
Simple GET request
The HtmlPage object will contain the HTML code, you can access it with the
asXml() method.
Now for each item, we are going to extract the title, URL, author etc. First let’s
take a look at what happens when you inspect a Hacker news post (right click
on the element + inspect on Chrome)
Extracting the data you want 29
• getHtmlElementById(String id)
• getFirstByXPath(String Xpath)
• getByXPath(String XPath) which returns a List
• Many more can be found in the HtmlUnit Documentation
Since there isn’t any ID we could use, we have to use an Xpath expression to
select the tags we want. We can see that for each item, we have two lines of
Extracting the data you want 30
text. In the first line, there is the position, the title, the URL and the ID. And
on the second, the score, author and comments. In the DOM structure, each
text line is inside a <tr> tag, so the first thing we need to do is get the full <tr
class="athing">list. Then we will iterate through this list, and for each item
select title, the URL, author etc with a relative Xpath and then print the text
content or value.
HackerNewsScraper.java
Selecting nodes with Xpath
System.out.println(jsonString);
}
}
Printing the result in your IDE is cool, but exporting to JSON or another
well formated/reusable format is better. We will use JSON, with the Jackson¹⁷
library, to map items in JSON format.
First we need a POJO (plain old java object) to represent the Hacker News
items :
HackerNewsItem.java
POJO
public class HackerNewsItem {
private String title;
this.author = author;
this.score = score;
this.position = position;
this.id = id;
}
//getters and setters
}
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.7.0</version>
</dependency>
And that’s it. You should have a nice list of JSON formatted items.
Extracting the data you want 33
Go further
This example is not perfect, there are many things that can be done :
In this chapter, we are going to see how to handle forms on the web. Knowing
how to submit forms can be critical to extract information behind a login form,
or to perform actions that require to be authenticated. Here are some examples
of actions that require to submit a form :
• Create an account
• Authentication
• Post a comment on a blog
• Upload an image or a file
Handling forms 35
Form Theory
Form diagram
There are two parts of a functional HTML form: the user interface (defined by
its HTML code and CSS) with different inputs and the backend code, which is
going to process the different values the user entered, for example by storing
it in a database, or charging the credit card in case of a payment form.
Handling forms 36
Form tag
Form diagram 2
HTML forms begins with a <form> tag. There are many attributes¹⁹. The most
important ones are the action and method attribute.
The action attribute represents the URL where the HTTP request will be sent,
and the method attribute specifies which HTTP method to use.
Generally, POST methods are used when you create or modify something, for
example:
• Login forms
• Account creation
• Add a comment to a blog
¹⁹https://developer.mozilla.org/en-US/docs/Web/HTML/Element/form
Handling forms 37
Form inputs
In order to collect user inputs, the <input> element is used. It is this element
that makes the text field appear. The <input> element has different attributes :
And here is the corresponding HTML code (CSS code is not included):
Handling forms 39
<div class="container">
<label for="uname"><b>Username</b></label>
<input type="text" placeholder="Enter Username" name="uname" requir\
ed>
<label for="psw"><b>Password</b></label>
<input type="password" placeholder="Enter Password" name="psw" requ\
ired>
<button type="submit">Login</button>
</div>
</form>
When a user fills the form with his credentials, let’s say usernameand my_-
great_password and click the submit button, the request sent by the browser
will look like this :
Http response
POST /login HTTP/1.1
Host: example.com
Content-Type: application/x-www-form-urlencoded
uname=username&psw=my_great_password
Cookies
After the POST request is made, if the credentials are valid the server will
generally set cookies in the response headers, to allow the user to navigate.
Handling forms 40
• session_id
• session
• JSESSION_ID
• PHPSESSID
This cookie will be sent for each subsequent requests by the browser, and the
website’s backend will check its presence and validity to authorize requests.
Cookies are not only used for login, but for lots of different use cases:
• Shopping carts
• User preferences
• Tracking user behavior
Cookies are small key/value pairs stored in the browser, or in an HTTP client,
that looks like this:
cookie_name=cookie_value
HTTP/1.0 200 OK
Content-type: text/html
Set-Cookie: cookie_name=cookie_value
Http request
• Expires: Expiration date, by default, cookies expire when the client closes
the connection.
• Secure: only sent to HTTPS URLs
• HttpOnly: Inaccessible to Javascript Document.cookie, to prevent session
hijacking and XSS attack²¹
• Domain: Specifies which host is allowed to receive the cookie
Login forms
To study login forms, let me introduce you the website I made to apply some
example in this book : https://www.javawebscrapingsandbox.com²²
This website will serve for the rest of the book for lots of different examples,
starting with the authentication example. Let’s take a look at the login form
HTML :
²¹https://developer.mozilla.org/en-US/docs/Glossary/Cross-site_scripting
²²https://www.javawebscrapingsandbox.com
Handling forms 42
There are two “difficult” thing here, the XPath expressions to select the
different inputs, and how to submit the form.
To select the email input, it is quite simple, we have to select the first input
inside a form, which name attribute is equal to email, so this XPath attribute
should be ok: //form//input[@name='email'].
Handling forms 43
Once you have the form object, you can generate the POST request for this
form using: loginForm.getWebRequest(null) that’s all you have to do :)
Let’s take a look at the full code:
Login example
}
}
This method works for almost every websites. Sometimes if the website uses a
Javascript framework, HtmlUnit will not be able to execute the Javascript code
(even with setJavaScriptEnabled(true) ) and you will have to either 1) inspect
the HTTP POST request in Chrome Dev Tools and recreate it, or use Headless
Chrome which I will cover in the next chapter.
Let’s take a look at the POST request created by HtmlUnit when we call
loginForm.getWebRequest(null). To view this, launch the main method in
debug mode, and inspect the content (ctrl/cmd + MAJ + D in eclipse) :
Handling forms 45
WebRequest[<url="https://www.javawebscrapingsandbox.com/account/login",
POST, EncodingType[name=application/x-www-form-urlencoded],
[csrf_token=1524752332##6997dd9d5ed448484131add18b41a4263541b5c2,
email=test@test.com,
password=test],
{Origin=https://www.javawebscrapingsandbox.com/account/login,
Accept=text/html,application/xhtml+xml,application/xml;q=0.9,image/web\
p,image/apng,*/*;q=0.8,
Cache-Control=max-age=0,
Referer=https://www.javawebscrapingsandbox.com/account/login,
Accept-Encoding=gzip, deflate}, null>]
We have a lot going one here. You can see that instead of just having two
parameters sent to the server (email and password), we also have a csrf_-
token parameter, and its value changes everytime we submit the form. This
parameter is hidden, as you can see in the form’s HTML :
CSRF token
CSRF stands for Cross Site Request Forgery. The token is generated by the
server and is required in every form submissions / POST requests. Almost
every website use this mechanism to prevent CSRF attack. You can learn
Handling forms 46
more about CSRF attack here²³. Now let’s create our own POST request with
HtmlUnit.
The first thing we need is to create a WebRequest object. Then we need to
set the URL, the HTTP method, headers, and parameters. Adding request
header to a WebRequest object is quite simple, all you need to to is to
call the setAdditionalHeader method. Adding parameters to your request
must me done with the setRequestParametersmethod, which takes a list
of NameValuePair. As discussed earlier, we have to add the csrf_token to
the parameters, which can be selected easily with this XPath expression :
//form//input[@name='csrf_token']
request.setRequestParameters(params);
request.setAdditionalHeader("Content-Type", "application/x-www-form-url\
encoded");
request.setAdditionalHeader("Accept-Encoding", "gzip, deflate");
page = client.getPage(request);
²³https://en.wikipedia.org/wiki/Cross-site_request_forgery
Handling forms 47
Login algorithm
inputLogin.setValueAttribute(login);
inputPassword.setValueAttribute(password);
try {
System.out.println("Starting autoLogin on " + loginUrl);
WebClient client = autoLogin(loginUrl, login, password);
HtmlPage page = client.getPage(baseUrl) ;
if(logoutLink != null ){
System.out.println("Successfuly logged in !");
// printing the cookies
for(Cookie cookie : client.
getCookieManager().getCookies()){
System.out.println(cookie.toString());
}
}else{
System.err.println("Wrong credentials");
}
} catch (Exception e) {
e.printStackTrace();
}
}
Handling forms 50
Go further
There are many cases where this method will not work: Amazon, DropBox…
and all other two-steps/captcha-protected login forms.
Things that can be improved with this code :
File Upload
File upload is not something often used in web scraping. But it can be
interesting to know how to upload files, for example if you want to test your
own website or to automate some tasks on websites.
There is nothing complicated, here is a little form on the sandbox website²⁵
(you need to be authenticated):
Form example
</form>
</div>
As usual, the goal here is to select the form, if there is a name attribute you can
use the method getFormByBame() but in this case there isn’t, so we will use a
good old XPath expression. Then we have to select the input for the file and
set our file name to this input. Note that you have to be authenticated to post
this form.
File upload example
fileName = "file.png" ;
page = client.getPage(baseUrl + "upload_file") ;
HtmlForm uploadFileForm = page.getFirstByXPath("//form[@action='/upload\
_file']");
HtmlFileInput fileInput = uploadFileForm.getInputByName("user_file");
fileInput.setValueAttribute(fileName);
fileInput.setContentType("image/png");
Other forms
Search Forms
Another common need when doing web scraping is to submit search forms.
Websites having a large database, like marketplaces often provide a search
form to look for a specific set of items.
There is generally three different ways search forms are implemented :
• When you submit the form, a POST request is sent to the server
• A GET request is sent with query parameters
• An AJAX call is made to the server
Search Form
Handling forms 53
page = client.getPage(form.getWebRequest(null));
Ouput
Basic Authentication
In the 90s, basic authentication was everywhere. Nowadays, it’s rare, but
you can still find it on corporate websites. It’s one of the simplest forms
of authentication. The server will check the credentials in the Authorization
header sent by the client, or issue a prompt in case of a web browser.
If the credentials are not correct, the server will respond with a 401 (Unauthorized)
response status.
Javascript 101
Javascript is an interpreted scripting language. It’s more and more used to
build “Web applications” and “Single Page Applications”.
The goal of this chapter is not to teach you Javascript, to be honest, I’m a
terrible Javascript developer, but I want you to understand how it is used on
the web, with some examples.
The Javascript syntax is similar to C or Java, supporting common data types,
like Boolean, Number, String, Arrays, Object… Javascript is loosely typed,
meaning there is no need to declare the data type explicitly.
Here is some code examples:
Dealing with Javascript 57
Jquery
jQuery²⁶ is one of the most used Javascript libraries. It’s really old, the first
version was written in 2006, and it is used for lots of things such as:
• DOM manipulation
• AJAX calls
• Event handling
• Animation
• Plugins (Datepicker etc.)
Here is a jQuery version of the same apple stock code (you can note that the
jQuery version is not necessarily clearer than the vanilla Javascript one…) :
Apple stock price
<!DOCTYPE html>
<html>
<head>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.2.1/jquer\
y.min.js"></script>
<script>
function refreshAppleStock(){
$.get("https://api.iextrading.com/1.0/stock/aapl/batch?types=quot\
e,news,chart&range=1m&last=10", function(data, status) {
$('#my_cell').html('$' + data.quote.latestPrice);
});
²⁶https://jquery.com/
Dealing with Javascript 59
$(document).ready(function(){
$("#refresh").click(function(){
refreshAppleStock();
});
});
</script>
</head>
<body>
<div>
<h2>Apple stock price:</h2>
<div id="my_cell">
</div>
<button id="refresh">Refresh</button>
</div>
</body>
</html>
If you want to know more about Javascript, I suggest you this excellent book:
Eloquent Javascript²⁷
The other problem with the traditional server-side rendering is that it can be
inefficient. Let’s say you are browsing a table on an old website. When you
request the next page, the server is going to render the entire HTML page,
with all the assets and send it back to your browser. With an SPA, only one
HTTP request would have been made, the server would have sent back a JSON
containing the data, and the Javascript framework would have filled the HTML
model it already has with the new values!
Here is a diagram to better understand how it works :
In theory, SPAs are faster, have better scalability and lots of other benefits
compared to server-side rendering.
That’s why Javascript frameworks were created. There are lots of different
Javascript frameworks :
These frameworks are often used to create so-called “Single Page Applica-
tions”. There are lots of differences between these, but it is out of this book
scope to dive into it.
²⁸https://angularjs.org/
²⁹https://www.emberjs.com/
³⁰https://reactjs.org/
³¹https://vuejs.org/
Dealing with Javascript 61
It can be challenging to scrape these SPAs because there are often lots of
Ajax calls and websockets³² connections involved. If performance is an issue,
you should always try to reproduce the Javascript code, meaning manually
inspecting all the network calls with your browser inspector, and replicating
the AJAX calls containing interesting data.
So depending on what you want to do, there are several ways to scrape these
websites. For example, if you need to take a screenshot, you will need a real
browser, capable of interpreting and executing all the Javascript code, that is
what the next part is about.
Headless Chrome
We are going to introduce a new feature from Chrome, the headless mode.
There was a rumor going around, that Google used a special version of Chrome
for their crawling needs. I don’t know if this is true, but Google launched the
headless mode for Chrome with Chrome 59 several months ago.
PhantomJS was the leader in this space, it was (and still is) heavy used
for browser automation and testing. After hearing the news about Headless
Chrome, the PhantomJS maintainer said that he was stepping down as
maintainer, because I quote “Google Chrome is faster and more stable than
PhantomJS […]” It looks like Chrome headless is becoming the way to go when
it comes to browser automation and dealing with Javascript-heavy websites.
HtmlUnit, PhantomJS, and the other headless browsers are very useful tools,
the problem is they are not as stable as Chrome, and sometimes you will
encounter Javascript errors that would not have happened with Chrome.
Prerequisites
• Google Chrome > 59
³²https://en.wikipedia.org/wiki/WebSocket
Dealing with Javascript 62
• Chromedriver³³
• Selenium
• In your pom.xml add a recent version of Selenium :
pom.xml
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>3.8.1</version>
</dependency>
If you don’t have Google Chrome installed, you can download it here³⁴ To
install Chromedriver you can use brew on MacOS :
Or download it using the link below. There are a lot of versions, I suggest you
to use the last version of Chrome and chromedriver.
Coinbase screenshot
We are going to manipulate Chrome in headless mode using the Selenium API.
The first thing we have to do is to create a WebDriver object, whose role is
similar the toe WebClient object with HtmlUnit, and set the chromedriver path
and some arguments :
Chrome driver
// Init chromedriver
String chromeDriverPath = "/Path/To/Chromedriver" ;
System.setProperty("webdriver.chrome.driver", chromeDriverPath);
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless", "--disable-gpu", "--window-size=1920\
,1200","--ignore-certificate-errors");
WebDriver driver = new ChromeDriver(options);
options.setBinary("/Path/to/specific/version/of/Google Chrome");
If you want to learn more about the different options, here is the Chromedriver
documentation³⁷
The next step is to perform a GET request to the Coinbase website, wait for
the page to load and then take a screenshot.
We have done this in a previous article, here is the full code :
GDAX Screenshot example
This is a common problem when scraping SPAs, and one way I like
to solve this is by using the WebDriverWait object:
WebDriverWait usage
WebDriverWait wait = new WebDriverWait(driver, 20);
wait.until(ExpectedConditions.
presenceOfElementLocated(By.xpath("/path/to/element")));
This was a brief introduction to headless Chrome and Selenium, now let’s see
some common and useful Selenium objects and methods!
³⁸https://seleniumhq.github.io/selenium/docs/api/java/org/openqa/selenium/support/ui/ExpectedConditions.html
Dealing with Javascript 66
Selenium API
In the Selenium API, almost everything is based around two interfaces : *
WebDriver which is the HTTP client * WebElement which represents a DOM
object
The WebDriver³⁹ can be initialized with almost every browser, and with
different options (and of course, browser-specific options) such as the window
size, the logs file’s path etc.
Here are some useful methods :
Method Description
driver.get(URL) performs a GET request to the
specified URL
driver.getCurrentUrl() returns the current URL
driver.getPageSource() returns the full HTML code for
the current page
driver.navigate().back() navigate one step back in the
history, works with forward
too
switch to the specified iFrame
driver.switchTo().frame(frameElement)
driver.manage().getCookies() returns all cookies, lots of other
cookie related methods exists
driver.quit() quits the driver, and closes all
associated windows
driver.findElement(by) returns a WebElement located by
the specified locator
The findElement() method is one of the most interesting for our scraping
needs.
You can locate elements with different ways :
• findElement(By.Xpath('/xpath/expression'))
³⁹https://seleniumhq.github.io/selenium/docs/api/java/org/openqa/selenium/WebDriver.html
Dealing with Javascript 67
• findElement(By.className(className)))
• findElement(By.cssSelector(selector)))
Once you have a WebElement object, there are several useful methods you can
use:
Method Description
findElement(By) you can again use this method, using a
relative selector
click() clicks on the element, like a button
getText() returns the inner text (meaning the
text that is inside the element)
sendKeys('some string') enters some text in an input field
getAttribute('href') returns the attribute’s value(in this
example, the href attribute)
Infinite scroll
Infinite scroll is heavily used in social websites, news websites, or when
dealing with a lot of information. We are going to see three different ways
to scrape infinite scroll.
I’ve set up a basic infinite scroll here: Infinite Scroll⁴⁰ Basically, each time you
scroll near the bottom of the page, an AJAX call is made to an API and more
elements are added to the table.
⁴⁰https://www.javawebscrapingsandbox.com/product/infinite_scroll
Dealing with Javascript 68
Infinite table
⁴¹https://developer.mozilla.org/en-US/docs/Web/API/Window
⁴²https://developer.mozilla.org/en-US/docs/Web/API/Window/scrollTo
Dealing with Javascript 69
driver.get("https://www.javawebscrapingsandbox.com/product/infinite_scr\
oll");
for(int i = 0; i < pageNumber; i++){
js.executeScript("window.scrollTo(0, document.body.scrollHeight);");
// There are better ways to wait, like using the WebDriverWait obje\
ct
Thread.sleep(1200);
}
List<WebElement> rows = driver.findElements(By.xpath("//tr"));
driver.quit();
the Chrome Dev tools, and find the <script> tag that contains the Javascript
code:
Javascript code
$(document).ready(function() {
var win = $(window);
var page = 1 ;
var apiUrl = '/product/api/' + page ;
tdName.innerText = json[i].name;
tdUrl.innerText = json[i].url ;
tdPrice.innerText = json[i].price;
tr.appendChild(tdName);
tr.appendChild(tdUrl);
tr.appendChild(tdPrice);
Dealing with Javascript 71
}
win.data('ajaxready', true);
if(url !== '/product/api/1' && url !== '/product/api/2'\
){
updatePage();
}
$('#loading').hide();
}
});
}
drawNextLines('/product/api/1');
drawNextLines('/product/api/2');
page = 3 ;
apiUrl = '/product/api/3';
// need to update the "ajaxready" variable not to fire multiple ajax ca\
lls when scrolling like crazy
win.data('ajaxready', true).scroll(function() {
// End of the document reached?
if (win.data('ajaxready') == false) return;
// fire the ajax call when we are about to "touch" the bottom of th\
e page
// no more data past 20 pages
if (win.scrollTop() + win.height() > $(document).height() - 100 && \
page < 20) {
$('#loading').show();
drawNextLines(apiUrl);
Dealing with Javascript 72
}
});
});
You don’t have to understand everything there, the only information that is
interesting is that each time we scroll near the bottom of the page (100 pixels
to be precise) the drawNextLines() function is called. It takes one argument, a
URL with this pattern /product/api/:id which will return 10 more rows.
Let’s say we want 50 more rows on our table. Basically we only have to make
a loop and call drawNextLines() five times. If you look closely at the Javascript
code, when the AJAX call is loading, we set the variable ajaxready to false. So
we could check the status of this variable, and wait until it is set to true.
Calling a Javascript function
driver.get("https://www.javawebscrapingsandbox.com/product/infinite_scr\
oll");
// we start at i=3 because on the first load, /product/api/1 and /produ\
ct/api/2 have already been called.
for(int i = 3; i < pageNumber + 3; i++){
js.executeScript("drawNextLines('/product/api/" + i +"');");
while((Boolean)js.executeScript("return win.data('ajaxready');") ==\
false){
Thread.sleep(100);
}
}
List<WebElement> rows = driver.findElements(By.xpath("//tr"));
We can clearly see the API URl being called, and what the response looks like.
Then we can use HtmlUnit or any other HTTP client to perform the requests
we want, and parse the JSON response with the Jackson library for example.
Let’s say we want the 50 first rows :
Dealing with Javascript 74
id: 33,
name: "LG Electronics OLED65C7P 65-Inch 4K Ultra HD Smart OLED \
TV (2017 Model)",
price: "2596.99",
url: "https://www.amazon.com/gp/product/B01NAYM1TP/ref=ox_sc_sf\
l_title_35?ie=UTF8"
},
...
]
Here is a simple way to parse this JSON array, loop over every element and
print it to the console. In general, we don’t want to do this, maybe you want
to export it to a CSV file, or save it into a database…
Parsing the JSON response
Here are some tips when working with JS rendered web pages:
• Try to find the hidden API using the network pane in Chrome
Dev Tools
• Try to disable Javascript in your web browser, some websites
switch to a server-side rendering in this case.
• Look for a mobile version of the target website, the UI is
generally easier to scrape. You can check this using your own
phone. If it works without redirecting to a mobile URL (like
https://m.example.com or https://mobile.example.com) try to
spoof the “User-Agent” request header in your request
• If the UI is tough to scrape, with lots of edge cases, look for
Javascript variable in the code, and access the data directly using
the Selenium Javascript Executor to evaluate this variable, as we
saw earlier.
Captcha solving, PDF parsing,
and OCR
In this chapter we are going to see several things, that can block you from
scraping websites / extracting information such as Captchas, data inside PDF
and images.
Captcha solving
Completely Automated Public Turing test to tell Computers and Hu-
mans Apart is what captcha stands for. Captchas are used to prevent
bots/scripts from accessing and performing actions on website or applications.
There are dozens of different captcha types, but you should have seen at least
these two:
Captcha solving, PDF parsing, and OCR 78
Old Captcha
Gooogle ReCaptcha v2
The last one is the most used captcha mechanism, Google ReCaptcha v2. That’s
why we are going to see how to “break” these captchas.
The only thing the user has to do is to click inside the checkbox. The service
will then analyze lots of factors to determine if it a real user, or a bot. We don’t
know exactly how it is done, Google didn’t disclose this for obvious reasons,
but a lot of speculations has been made:
Captcha solving, PDF parsing, and OCR 79
• Clicking behavior analysis: where did the user click ? Cursor acceleration
etc.
• Browser fingerprinting
• Click location history (do you always click straight on the center, or is it
random, like a normal user)
• Browser history and cookies
For old captchas like the first one, Optical Caracter Recognition and recent
machine-learning frameworks offer an excellent solving accuracy (sometimes
better than Humans…) but for Recaptcha v2 the easiest and more accurate way
is to use third-party services.
Many companies are offering Captcha Solving API that uses real human
operators to solve captchas, I don’t recommend one in particular, but I have
found 2captcha.com⁴³ easy to use, reliable and cheap (it is $2.99 for 1000
captchas).
Under the hood, 2captcha and other similar APIs need the specific site-key and
the target website URL, with this information they are able to get a human
operator to solve the captcha.
It is this token that interests us, and 2captcha API will send it back. Then we
will need to fill the hidden input with this token and submit the form.
The first thing you will need to do is to create an account on 2captcha.com⁴⁴
and add some fund.
You will then find your API key on the main dashboard.
As usual, I have set up an example webpage⁴⁵ with a simple form with one
input and a Recaptcha to solve:
⁴⁴https://2captcha.com?from=6028997
⁴⁵https://www.javawebscrapingsandbox.com/captcha
Captcha solving, PDF parsing, and OCR 81
Form + captcha
We are going to use Chrome in headless mode to post this form and HtmlUnit
to make the API calls to 2captcha (we could use any other HTTP client for
this). Now let’s code.
Instanciate WebDriver and WebClient
final String API_KEY = "YOUR_API_KEY" ;
final String API_BASE_URL = "http://2captcha.com/" ;
final String BASE_URL = "https://www.javawebscrapingsandbox.com/captcha\
";
driver.get(BASE_URL);
try {
siteId = elem.getAttribute("data-sitekey");
} catch (Exception e) {
System.err.println("Catpcha's div cannot be found or missing attrib\
ute data-sitekey");
e.printStackTrace();
}
String QUERY = String.format("%sin.php?key=%s&method=userrecaptcha&goog\
lekey=%s&pageurl=%s&here=now",
API_BASE_URL, API_KEY, siteId, BASE_URL);
Page response = client.getPage(QUERY);
String stringResponse = response.getWebResponse().getContentAsString();
String jobId = "";
if(!stringResponse.contains("OK")){
⁴⁶https://2captcha.com/2captcha-api#solving_recaptchav2_new
Captcha solving, PDF parsing, and OCR 83
Now that we have the job ID, we have to loop over another API route to
know when the ReCaptcha is solved and get the token, as explained in the
documentation. It returns CAPCHA_NOT_READY and still the weirdly formatted
OK|TOKEN when it is ready:
make the input visible, fills it, make it hidden again so that we can click on the
submit button:
Hidden input
textarea.sendKeys(captchaToken);
js.executeScript("document
.getElementById('g-recaptcha-response').style.display = 'none';");
driver.findElement(By.id("name")).sendKeys("Kevin");
driver.getPageSource();
driver.findElement(By.id("submit")).click();
And that’s it :) Generally, websites don’t use ReCaptcha for each HTTP
requests, but only for suspicious ones, or for specific actions like account
creation, etc. You should always try to figure out if the website is showing you
a captcha / Recaptcha because you made too many requests with the same IP
address or the same user-agent, or maybe you made too many requests per
second.
As you can see, “Recaptcha solving” is really slow, so the best way to “solve”
this problem is by avoiding catpchas in the first place !
Captcha solving, PDF parsing, and OCR 85
PDF parsing
Adobe created the Portable Document Format in the early 90s. It is still
heavily used today for cross-platform document sharing. Lots of websites use
PDF export for documents, bills, manuals… And maybe you are reading this
eBook in the PDF format. It can be useful to know how to extract pieces of
information from PDF files, and that is what we are going to see.
I made a simple page⁴⁷, with a link to a PDF invoice. The invoice looks like
this:
⁴⁷https://www.javawebscrapingsandbox.com/pdf
Captcha solving, PDF parsing, and OCR 86
Invoice
We are going to see how to download this PDF and extract information from
it.
Prerequisites
We will need HtmlUnit to get the webpage and download the PDF, and
PDFBox library to parse it.
Captcha solving, PDF parsing, and OCR 87
pom.xml
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.4</version>
</dependency>
if(pdf.getWebResponse().getContentType().equals("application/pdf")){
System.out.println("Pdf downloaded");
IOUtils.copy(pdf.getWebResponse().getContentAsStream(),
new FileOutputStream("invoice.pdf"));
System.out.println("Pdf file created");
}
Captcha solving, PDF parsing, and OCR 88
Anytown, State
ZIP
COMPANY NAME
We just have to loop over each line, and use a regular expression with a
capturing group like this one: "Total\\s+€\\s+(.+)" to extract the total price.
We could extract everything we want with another regex, like the email
address, the postal address, invoice number…
Here is the full code:
Scraping the Invoice
PDDocument document = null;
try{
document = PDDocument.load(new File("invoice.pdf")) ;
if(!price.isEmpty()){
System.out.println("Price found: " + price);
}else{
Captcha solving, PDF parsing, and OCR 90
There are many methods in the PDFBox library, you can work with password
protected PDF, extract specific text area, and many more, here is the docu-
mentation⁴⁸.
Installation
Installing Tesseract and all dependencies is really easy, on linux:
⁴⁸https://pdfbox.apache.org/docs/2.0.8/javadocs/
⁴⁹https://github.com/tesseract-ocr/
Captcha solving, PDF parsing, and OCR 91
And on macOS:
<dependency>
<groupId>org.bytedeco.javacpp-presets</groupId>
<artifactId>tesseract-platform</artifactId>
<version>3.05.01-1.4.1</version>
</dependency>
Tesseract example
I took a screenshot of the previous PDF:
⁵⁰https://github.com/tesseract-ocr/tesseract/wiki
Captcha solving, PDF parsing, and OCR 92
OCR example
if (api.Init(TESS_DATA_PATH, "ENG") != 0) {
System.err.println("Could not initialize tesseract.");
System.exit(1);
}
This was just an example on how to use Tesseract for simple OCR, I’m not an
expert on OCR and image processing, but here are some tips:
⁵¹https://github.com/tesseract-ocr/tesseract/wiki
Stay under cover
In this chapter, we are going to see how to make our bots look like Humans.
For various reasons, there are sometimes anti-bot mechanisms implemented
on websites. The most obvious reason to protect sites from bots is to prevent
heavy automated traffic to impact a website’s performance. Another reason is
to stop bad behavior from bots like spam.
There are various protection mechanisms. Sometime your bot will be blocked
if it does too many requests per second / hour / day. Sometimes there is a rate
limit on how many requests per IP address. The most difficult protection is
when there is a user behavior analysis. For example, the website could analyze
the time between requests, if the same IP is making requests concurrently.
You won’t necessarily need all the advice in this chapter, but it might help you
in case your bot is not working, or things don’t work in your Java code the
same as it works with a real browser.
Headers
In Chapter 3 we introduced HTTP headers. Your browser includes system-
atically 6-7 headers, as you can see by inspecting a request in your browser
network inspector:
Stay under cover 95
Request headers
If you don’t send these headers in your requests, the target server can easily
recognize that your request is not sent from a regular web browser. If the
server has some kind of anti-bot mechanism, different things can happen: *
The HTTP response can change * Your IP address could be blocked * Captcha
* Rate limit on your requests
HtmlUnit provides a really simple way to customize our HTTP client’s headers
Init WebClient with request headers
WebClient client = new WebClient();
client.addRequestHeader("Accept", "text/html,application/xhtml+xml,appl\
ication/xml;q=0.9,image/webp,image/apng,*/*;q=0.8");
client.addRequestHeader("Accept-Encoding", "gzip, deflate, br");
client.addRequestHeader("Accept-Language", "en-US,en;q=0.9,fr-FR;q=0.8,\
fr;q=0.7,la;q=0.6");
client.addRequestHeader("Connection", "keep-alive");
client.addRequestHeader("Host", "ksah.in");
client.addRequestHeader("User-Agent", "Mozilla/5.0 (Macintosh; Intel Ma\
c OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396\
.99 Safari/537.36");
client.addRequestHeader("Pragma", "no-cache");
And then have a little helper method that reads this file, and returns a random
user agent:
Stay under cover 97
} catch (IOException e) {
e.printStackTrace();
}
return userAgents.get(rand.nextInt(userAgents.size()));
}
Proxies
The easiest solution to hide our scrapers is to use proxies. In combination with
random user-agent, using a proxy is a powerful method to hide our scrapers,
and scrape rate-limited web pages. Of course, it’s better not be blocked in the
first place, but sometimes websites allow only a certain amount of request per
day / hour.
In these cases, you should use a proxy. There are lots of free proxy list, I
don’t recommend using these because there are often slow, unreliable, and
Stay under cover 98
websites offering these lists are not always transparent about where these
proxies are located. Sometimes the public proxy list is operated by a legit
company, offering premium proxies, and sometimes not… What I recommend
is using a paid proxy service, or you could build your own.
Setting a proxy to HtmlUnit is easy:
Scrapoxy⁵² is a great open source API, allowing you to build a proxy API on
top of different cloud providers.
⁵²http://scrapoxy.io/
Stay under cover 99
Then you have to launch to TOR daemon, and set the proxy config for the
WebClient
⁵³https://www.torproject.org/
Stay under cover 100
Tips
Cookies
Cookies are used for lots of reasons, as discussed earlier. If you find that the
target website is responding differently with your bots, try to analyze the
cookies that are set by client-side Javascript code and inject them manually.
You could also use Chrome with the headless mode for better cookie handling.
Timing
If you want to hide your scrapers, you have to behave like a human. Timing
is key. Humans don’t mass click on links 0.2 seconds after arriving to a web
page. They don’t click on each link every 5 seconds too. Add some random
time between your requests to hide your scrapers.
Fast scraping is not a good practice. You will get blocked, and if you do this on
small websites it will put a lot of pressure on the website’s servers, it can even
be illegal in some cases, as it can be considered like an attack.
Invisible elements
Invisible elements is a technique often used to detect bot accessing and
crawling a website. Generally, one or more elements are hidden with CSS
and there is some code that notifies the website’s server if there is a click on
Stay under cover 101
the element, or a request to a hidden link. Then the server will block the bot’s
IP address.
A good way to avoid this trap is to use the isDisplayed() method with the
Selenium API:
Interracting with visible elements only
<form>
<input type="hidden" name="itsatrap" value="value1"/>
<input type="text" name="email"/>
<input type="submit" value="Submit"/>
</form>
Cloud scraping
Serverless
In this chapter, we are going to introduce serverless deployment for our
bots. Serverless is a term referring to the execution of code inside ephemeral
containers (Function As A Service, or FaaS). It is a hot topic in 2018, after the
“micro-service” hype, here come the “nano-services”!
Cloud functions can be triggered by different things such as:
Cloud functions are a really good fit for web scraping for many reasons. Web
Scraping is I/O bound, most of the time is spent waiting for HTTP responses,
so we don’t need high end CPU servers. Cloud functions are cheap and easy
to setup. Cloud function are a good fit for parallel computing, we can create
hundreds or thousands of function at the same time for large scale scraping.
Cloud scraping 103
We
are going to deploy a scraper into Azure cloud function. I don’t have any
preferred vendor, AWS Lambda is a great platform too. Google Cloud doesn’t
support Java at the moment, only Node.js.
We are going to re-use the Hacker news scraper we built in chapter 3 and
implement a little API on top of it, so that we will be able to call this API with
a page parameter, and the function will return a JSON array of each hacker
news item for this page number.
Prerequisites
You will need :
• JDK 8
• Maven 3+
• Azure CLI⁵⁴
⁵⁴https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest
Cloud scraping 104
az login
mvn archetype:generate \
-DarchetypeGroupId=com.microsoft.azure \
-DarchetypeArtifactId=azure-functions-archetype
Then Maven will ask you details about the project. The generated code is
concise and straightforward:
⁵⁵https://docs.microsoft.com/en-us/azure/azure-functions/functions-run-local#v2
⁵⁶https://azure.microsoft.com/en-us/free/
⁵⁷https://maven.apache.org/guides/introduction/introduction-to-archetypes.html
Cloud scraping 105
if (name == null) {
return request.createResponse(400, "Please pass a name on t\
he query string or in the request body");
} else {
return request.createResponse(200, "Hello, " + name);
}
}
}
There might be some errors if you didn’t correctly install the previous
requirements.
Deploying your Azure Function is as easy as:
mvn azure-functions:deploy
Azure will create a new URL for your function each time you deploy your app.
The first invocation will be very slow, it can sometimes take up to one minute.
This “issue” is called cold start. The first time you invoke a function, or when
you haven’t called a function for a “long” time (i.e several minutes), Azure has
to :
• spin a server
• configure it
• load your function code and all the dependencies
https://azure.microsoft.com/
Function hnitems
@FunctionName("hnitems")
public HttpResponseMessage<String> hnitems(
@HttpTrigger(name = "req", methods = {"get"}, authLevel = Autho\
rizationLevel.ANONYMOUS) HttpRequestMessage<Optional<String>> request,
final ExecutionContext context) {
context.getLogger().info("Java HTTP trigger processed a request.");
if (pageNumber == null) {
return request.createResponse(400, "Please pass a pageNumber on\
the query string");
}else if(!StringUtils.isNumeric(pageNumber)) {
return request.createResponse(400, "Please pass a numeric pageN\
umber on the query string");
}else {
HNScraper scraper = new HNScraper();
String json;
try {
json = scraper.scrape(pageNumber);
} catch (JsonProcessingException e) {
e.printStackTrace();
return request.createResponse(500, "Internal Server Error w\
hile processing HN items: ");
}
return request.createResponse(200, json);
}
}
You should have your function URL in the log. It’s time to test our modified
API (replace ${function_url} with your own URL)
curl https://${function_url}/api/hnitems?pageNumber=3
[
{
"title": "Nvidia Can Artificially Create Slow Motion That Is Be\
tter Than a 300K FPS Camera (vice.com)",
"url": "https://motherboard.vice.com/en_us/article/ywejmy/nvidi\
a-ai-slow-motion-better-than-a-300000-fps-camera",
"author": "jedberg",
"score": 27,
"position": 121,
"id": 17597105
},
{
"title": "Why fundraising is a terrible experience for founders\
: Lessons learned (kapwing.com)",
"url": "https://www.kapwing.com/blog/the-terrible-truths-of-fun\
draising/",
"author": "jenthoven",
"score": 74,
"position": 122,
"id": 17594807
},
{
"title": "Why No HTTPS? (whynohttps.com)",
"url": "https://whynohttps.com",
Cloud scraping 110
"author": "iafrikan",
"score": 62,
"position": 123,
"id": 17599022
},
...
This is it. Instead of returning the JSON array, we could store it in the different
database systems supported by Azure.
I suggest you experiment, especially around messaging queues. An interesting
architecture for your scrapping project could be to send jobs into a message
queue, let Azure function consume these jobs, and save the results into a
database. You can read more about this subject here⁵⁹
The possibilities of Azure and other Cloud providers like Amazon Web Service
are endless and easy to implement, especially serverless architecture, and I
really recommend you to experiment with these tools.
Conclusion
This is the end of this guide. I hope you enjoyed it. You should now be
able to write your own scrapers, inspect the DOM and network request, deal
with Javascript, reproduce AJAX calls, beat Catpcha and Recaptcha, hide your
scrapers with different techniques, and deploy your code in the cloud !
This book will never be finished, as I get so much feedback from my readers.
There are many chapters I would like to add. More case study, a chapter about
the legal side of web scraping, a chapter about multithreaded scraping etc. If
there is enough people interested, I will maybe create a full online video course
:)
I made a Google Form⁶⁰ to get feedback from my readers, I would really
appreciate if you could answer it !
⁵⁹https://docs.microsoft.com/en-us/azure/azure-functions/functions-create-storage-queue-triggered-function
⁶⁰https://docs.google.com/forms/d/e/1FAIpQLSeis4z-NHXeFJfeRLQ6L82-YawEb6ABrOWsN0F4ZIsPZp6cug/viewform
Cloud scraping 111
⁶¹https://twitter.com/SahinKevin