Webscraper.
io
A How-to Guide for Scraping DH Projects
1. Introduction to [Link]
1.1. Installing [Link]
1.2. Navigating to [Link]
2. Creating a Sitemap
2.1. Sitemap Menu
2.2. Importing a Sitemap
2.3. Creating a Blank Sitemap
2.4. Editing Project Metadata
3. Selector Graph
4. Creating a Selector
5. Scraping a Website
6. Browsing Scraped Data
7. Exporting Sitemaps
8. Exporting Data
Association of Research Libraries 21 Dupont Circle NW, Suite 800, Washington, DC 20036
(202) 296-2296 | [Link]
1. Introduction to [Link]
[Link] is a free extension for the Google Chrome web browser with which users can
extract information from any public website using HTML and CSS and export the data as a
Comma Separated Value (CSV) file, which can be opened in spreadsheet processing software
like Excel or Google Sheets. The scraper uses the developer tools menu built into Chrome
(Chrome DevTools) to select the different elements of a website, including links, tables, and the
HTML code itself. With developer tools users can look at a web page to see the code that
generates everything that is seen on the page, from text, to images, to the layout.
[Link] uses this code either to extract information or navigate throughout the overall
page. This is helpful for users who don’t have another way to extract important information from
websites. It is important to be aware of any copyright information listed on a website. Consult a
copyright lawyer or professional to ensure the information can legally be scraped and shared.
This guide was developed alongside a project for extracting information from websites using
content management systems like Omeka to make them discoverable. We focused on Omeka
because it has been widely adopted by the Digital Humanities community and provides a range
of information presentation types, including titles and descriptions of each project as well as the
collections and exhibits held within them. It also extracts any contributors and organizations
listed. The screenshots used in this documentation are pulled from Colored Conventions, which
is held under a Creative Commons Attribution Non-Commercial Share-Alike 4.0 International
Copyright license. Colored Conventions is a straightforward website with little complex coding,
which allows [Link] to function at its best.
Association of Research Libraries 2
1.1. Installing [Link]
Type ‘[Link]’ into your URL bar to navigate to the scraper’s website. The site has a
wealth of documentation as well as a very active forum. [Link] regularly updates both of
these sections with information that can help resolve specific issues that arise. To install the
extension itself, click on the blue ‘Download Free on Chrome Store’ button.
Figure 1 - [Link] homepage
Click on the green button that says ‘Add to Chrome’ at the top right corner of this new page tol
install the extension. If the installation is successful, the button will gray out and read ‘Added to
Chrome.’
Association of Research Libraries 3
Figure 2 - Click the ‘Added to Chrome’ button to install [Link]
Association of Research Libraries 4
1.2. Navigating to [Link]
The scraper can be found in the Developer tools menu. The first way to get to this menu is to
press the F12 button on either a Mac or PC. The other way is to click on the three vertical dots
in the upper right corner of the window. Both these ways bring up the browser menu, which is
the same menu that opens a new tab or window as well as the history or print panels. Hover the
mouse over the ‘More tools’ text roughly two-thirds of the way down to show a sub-menu. This
sub-menu has even more options, but the web scraper is in ‘Developer tools’ at the bottom of
this new menu. Click ‘Developer tools’ to open the panel. See Figure 3 below.
Figure 3 - Opening ‘Developer tools’
Association of Research Libraries 5
This opens a new panel at the bottom of the browser. [Link] is the last option in this
panel, as you can see in Figure 4 below.
Figure 4 - ‘Web Scraper’ panel
The first window that appears when navigating to [Link] is the Sitemap panel (See
Creating a Sitemap). A sitemap organizes all the information required for scraping a particular
website. It will be blank at install, but once you create sitemaps, they will appear here. The first
column lists the ID, or name, of each sitemap. The second column is the URL or web address
for the first page of that sitemap.
Association of Research Libraries 6
2. Creating a Sitemap
2.1. Sitemap Menu
[Link] automatically opens to the Sitemap Menu, which lists all of the user-created
sitemaps in the scraper. Here, users can see all of their sitemaps alongside each starting URL.
They also have the option to delete sitemaps. Be careful not to delete sitemaps, as they cannot
be recovered unless there are exports of them saved elsewhere (See Export Sitemaps). Click a
sitemap’s title or URL to open it.
Figure 5 - Sitemap menu
Sitemaps serve to organize all the information about scraping a particular website in one
location. They house the various selectors (See Creating a Selector) and instruct the web
scraper what the titles and starting URLs are for them to scrape. To create a sitemap, click on
the ‘Create new Sitemap’ button. Then you can either import a previously built sitemap or create
a blank sitemap.
Association of Research Libraries 7
Figure 6 - ‘Create Sitemap’
Association of Research Libraries 8
2.2. Importing a Sitemap
A user who has already created a [Link] sitemap has the option to export and share
that sitemap with other users to import in their own web scrapers (See Exporting Sitemaps). The
‘Import Sitemap’ button creates a sitemap, which can then be manipulated. Importing a sitemap
requires the JavaScript Object Notation (JSON) that another user’s instance of [Link]
generated. Clicking on the ‘Import sitemap’ button brings up two text entry fields. The user
copies and pastes the JSON, which is formatted in a particular way, into the larger of the two
[Link] user can rename the sitemap something distinct from the imported JSON code in the
second box to ensure that there is no duplicate sitemap within [Link]. For group
projects, it also helps to keep track of the date you imported, a sitemap, or who worked on it, by
adding the relevant information to the end of the title.
Figure 7 - Import a sitemap
Association of Research Libraries 9
2.3. Creating a Blank Sitemap
The ‘Create Sitemap’ button opens a window similar to a window opened by the ‘Import
Sitemap’ button. The difference here is that there is no previous information, and the new
sitemap will have no scraping information within it. The user creates a new sitemap at the
beginning of any project in order to create the selectors that will extract information from a
website. This requires the sitemap name and the URL for a website, which is usually the
homepage. The sitemap title has a few rules: it cannot have any capital letters, limits the special
characters it recognizes, and must start with a letter. It may be helpful to copy and paste the
URL into the ‘Start URL’ field to avoid errors.
Figure 8 - Creating a blank sitemap
Once the user enters the title and URL, he or she should click on the ‘Create Sitemap’ button to
add it to the web scraper.
Association of Research Libraries 10
2.4. Editing Project Metadata
If the sitemap name or start URL ever need to be changed, users can do so in the ‘Edit
metadata’ panel in cases where there are errors or if the project belongs to a larger project
outside of the web scraper that mandates a change. The sitemap name is almost always the
information that needs to be changed, not the start URL.
Figure 9 - ‘Edit metadata’ panel
Fields are changed in the same way as when creating the sitemap. Be aware that changing the
start URL can affect prebuilt selectors in unintended ways, especially those that select unique
information. Since the selectors only use the HTML, any selector on the homepage will look for
that code. If the homepage changes, the selectors will look for code that may not exist on the
new page and then return a ‘null’ in the scraped data. Another issue that may arise is that the
scraper may extract the wrong information. The HTML at the start URL may not change, but its
contents may have. This can lead to confusion when reviewing the scraped data. It is wise to
double-check selectors so that they still act as expected after changing the start URL.
Association of Research Libraries 11
Figure 10 - Metadata fields
Association of Research Libraries 12
3. Selector Graph
The selector graph is a visual aid that shows the sitemap hierarchy of selectors, including which
selectors are linked to the _root (homepage) and then which selectors are attached to the
homepage. This repeats until all selectors are visible, helping users understand where the
various selectors are in relation to others. As you can see in Figure 11, the homepage has a
number of selectors attached to it, including project_title, PrincipalInvestigatorContact, and
about. The selectors beyond the first level, like the about selector, show that there is information
to scrape beyond what is selected in the first selector.
Figure 11 - Selector graph
Association of Research Libraries 13
4. Creating a Selector
Users build sitemaps with selectors. These selectors tell [Link] what to do with each
element on the website, including extracting a paragraph of text, clicking on a link, or scrolling
down the page. Selectors are essentially the instruction manual for the web scraper.
[Link] will only do what the selectors within a sitemap tell it to do.
Users can input information into a number of different fields for each selector, including ID,
Type, Selector, Multiple Checkbox, Regex, Delay, and Parent selectors (See Figure 11). The ID,
short for Identifier, is the title or label for the selector. It will appear in the selector menu and the
selector graph. Keep in mind that it will also become the column header in the exported CSV
file. The most common IDs for DH projects are titles and descriptions.
The type menu tells [Link] what kind of information is being used and how it will
manipulate it. This helps the scraper organize data or navigate the website. The most common
types are text, which will be extracted, and link, which will extract both the link text and URL.
The link selector also tells [Link] that there is information on the linked page, which
allows users to add selectors to new pages for scraping. Other types that we did not test are
HTML selectors and the element selection.
The selector field is the most important field other than the ID. It is where users select the
elements on the web page that they want to use.
Once the sitemap is open, click on the ‘Add new selector’ button to start building the sitemap
(See Figure 12). This will open the Selector panel (See Figure 13).
Figure 12 - ‘Add new selector’
Association of Research Libraries 14
Figure 13 - New selector panel
Type in the name of the selector in the ‘Selector ID’ text box. There are no rules regarding the
ID. If the web scraping is part of a larger project outside of the scraping itself, we suggest using
the vocabulary set up for that larger project.
Clicking on the ‘Select’ button makes a green highlight appear around the different HTML
elements on the web page. It highlights both HTML tags, like h or p tags, and CSS div and
container tags. Clicking on an element adds the code to the selector bar. Each green highlight
turns red when you select it. The checkbox in the select bar allows users to select multiple types
of tags, which is useful for keeping a title with its description. The ‘Done selecting’ button adds
the selections to the selector text box. Users can also edit the text, if necessary.
The ‘Element preview’ button puts a red highlight around all the elements in the selector code.
This helps ensure that all the elements are selected. By contrast, the ‘Data preview’ button
opens a pop-up window with a snapshot of the data that is set for extraction within this selector
when [Link] scrapes the sitemap.
Association of Research Libraries 15
Figure 14 - Select HTML element
The Multiple checkbox tells [Link] to extract more than one of the selected elements.
This is helpful when there are lists or navigation links with more than one of the same tags on
the page.
We did not test the Regex box during this project, which enters regular expressions into the
export file to manipulate data. We also did not test the delay, which tells the scraper to load the
web page for a given amount of time before running the selector.
The Parent selector places the selector in the correct spot in the sitemap hierarchy, telling
[Link] that the current selector is extracting information from the homepage or on one
of the pages to which another selector is linked. The Parent selector divides by hierarchy, as
seen in the selector graph, with the homepage listed first as ‘_root,’ followed by the selectors on
that page. Those further down the list are selectors that usually link from the homepage to other
pages in the website.
Association of Research Libraries 16
Figure 15 - ‘Parent Selectors’
Ensure that everything is correct before saving your selector so that the wrong sitemap is not
selected, or that no sitemap is selected at all. If the user does not highlight a Parent selector
before saving it, the selector disappears altogether even though [Link] thinks it is there.
This means that the ID is in use but does not appear anywhere in the sitemap or in the exports.
There is no solution for those selectors that cannot be found, which is bug in the software of
which [Link] is aware. A mislabeled selector will still appear, but it must be found
manually and then edited to point to the correct Parent selector. The best way to find this type of
error is to use the Selector Graph to show all selectors, minus the instance just discussed. Once
found, users can navigate to the Parent selector in the selector menu and change it accordingly.
5. Scraping a Website
Scraping a website has three main steps:
1. Creating a sitemap and selectors
2. Extracting information
3. Exporting information
This section discusses the second step. (See Sections 1 to 4 for step one and Sections 6 to 8
for step three.) Everything up to this point set up [Link] to perform a scrape of a
website. Creating a sitemap and selectors tells the system what to do during the scraping
process. Then, users tell [Link] to go through all selectors and perform the actions set
within them with the Scrape panel, when the data is actually extracted from the website. The
scraper uses the information pulled here to generate previews and export files.
Association of Research Libraries 17
Figure 16 - ‘Scrape’ button
Users have the option to add either a request interval or a page load delay to the entire scraping
process. (See Creating a Selector). With both options, the scraper loads pages with different
timing so that websites can load information before the scraper begins extracting information.
The time delay is in milliseconds, with a default of 2000. Anything shorter than this may mean
that the page has not loaded information for scraping. Both options add time for a page to load
in case there is a lot of information, or if there are elements that take more time to load. Once
the preferred time is entered, click the ‘Start scraping’ button.
Figure 17 - Scrape panel
Once started, a scrape opens a new browser window containing the web page to which a
selector has been directed and cycles through it. There is no indication that the information is
being extracted, but the windows will open and close as the scraper goes through the sitemap.
Association of Research Libraries 18
When the scrape is finished, a pop-up window appears in the bottom right corner of the
computer screen stating the scrape is done.
Figure 18 - Scrape window
The web scraper automatically directs to the Browse panel when it is finished. Clicking on the
‘Refresh’ button shows the data preview.
Figure 19 - Finished scrape
Association of Research Libraries 19
6. Browsing Scraped Data
Any scraped data can be viewed by accessing the Browse panel. The scraper also
automatically redirects to this panel when a scrape is finished.
Figure 20 - Browse data
If no data appears, click ‘Refresh.’
Figure 21 - Refresh data
[Link] sets the data up as a spreadsheet and provides a preview of the data prior to
downloading the CSV file (See Figure 22). This helps users ensure all the data is present and
accounted for, including the information present in the various HTML and CSS tags from all the
Association of Research Libraries 20
selectors in the sitemap. Note that the selector ID is now the column header. The exported CSV
file will structure the data the same way as the preview.
Figure 22 - Preview of data
Association of Research Libraries 21
7. Exporting Sitemaps
Any sitemap with information can be exported using the ‘Export sitemap’ panel. Exporting a
sitemap involves all the information except scraped data, such as the sitemap name, starting
URL, and all the selectors created in the sitemap (see Exporting Data).
Figure 23 - ‘Export Sitemap’ button
The sitemap export generates JSON code in the box that opens when users access the panel.
The safest way to copy the code is to click within the box and then hit CTRL+A to select all text.
Users can then copy the code by either pressing CTRL+C or right-clicking their mouse and
selecting ‘Copy.’ The code can then be pasted as a text file into a word processor to save a
copy or into an email for sharing. Any changes elsewhere in [Link] will also change this
export, so any previously saved sitemaps will not be accurate.
Association of Research Libraries 22
Figure 24 - Export sitemap panel
8. Exporting Data
[Link] exports the scraped data through the ‘Export data as CSV’ panel. This is
different from exporting a sitemap (See Export Sitemaps), as this panel downloads a CSV file to
the user’s computer. Users must have already done a scrape of the website to extract the
information (See Scraping a Website). This is typically the last step in a scraping project, as it is
the final output of [Link].
Figure 25 - ‘Export data as CSV’ panel
A downloadable file generates as soon as a user enters this panel, with a blue ‘Download now’
link appearing when the file is ready for download. Once clicked, the file downloads to the
Association of Research Libraries 23
location set in the browser settings. A pop-up box also appears at the bottom of the page, which
users can use to open the file directly.
Figure 26 - Export data as CSV
CSV files can be opened in spreadsheet software like Excel or Google Sheets, as opposed to a
word processor. Users can also convert them to another file type fairly easily.
Association of Research Libraries 24