0% found this document useful (0 votes)
71 views3 pages

WebScraping Lessons 2

Web Parsing Course: Lesson 2 - Working with Webpage Structures and Advanced HTML Parsing

Uploaded by

jofil39669
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views3 pages

WebScraping Lessons 2

Web Parsing Course: Lesson 2 - Working with Webpage Structures and Advanced HTML Parsing

Uploaded by

jofil39669
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Web Parsing Course: Lesson 2 - Working with Webpage Structures and Advanced

HTML Parsing

Objective:

In this lesson, we will dive deeper into webpage structures, understanding how to navigate through
complex HTML pages, and learn advanced parsing techniques with BeautifulSoup. You will be
able to extract specific elements efficiently from more complicated web layouts and deal with
additional HTML components like forms and tables.

Lesson Outline:

1. Understanding Webpage Structures in Depth


o HTML Hierarchy: Understanding how elements are nested within other elements
(parent-child relationships).
o DOM (Document Object Model): The tree-like structure of an HTML document.
o Common Tags: A deeper dive into common HTML tags you will encounter:
 <div>, <span>, <table>, <ul>, <li>, <form>, etc.
o Attributes and Classes: HTML elements often have attributes like class, id, and
name. These are useful when targeting specific elements for extraction.
o Dynamic Content: Some content may not be directly available in the HTML
(rendered by JavaScript), which we'll discuss later with browser automation.
2. Navigating the HTML Tree with BeautifulSoup
o Accessing Tags and Attributes:
 How to locate specific tags (find, find_all), and filter by attributes (class,
id, etc.):

python
Копіювати код
soup.find('div', class_='example-class')

o Navigating Parent, Child, and Sibling Elements:


 Parent: Accessing the parent element of a tag.
 Children: Accessing nested tags inside an element.
 Siblings: Moving between elements at the same level in the DOM hierarchy.

python
Копіювати код
parent = soup.find('div', class_='container')
children = parent.findChildren('div')
sibling = parent.find_next_sibling('div')

3. Working with Complex HTML


o Nested Tags: Extracting data from deeply nested elements.
 Example: Extracting product details from an e-commerce site that has a
nested structure.
o Multiple Classes/Attributes:
 How to deal with elements that have multiple class names or specific
attributes:

python
Копіювати код
soup.find_all('div', {'class': ['class1', 'class2']})
o Extracting Text from Tags:
 Retrieving only the text content inside tags using .text or .get_text():

python
Копіювати код
tag = soup.find('h1')
print(tag.get_text())

4. Parsing Forms and Tables


o Forms:
 Forms are crucial for extracting user input fields from a webpage.
 Extracting form fields and their values:

python
Копіювати код
form = soup.find('form')
inputs = form.find_all('input')
for input_tag in inputs:
print(input_tag.get('name'), input_tag.get('value'))

o Tables:
 Handling <table>, <tr>, and <td> tags to extract structured data.
 Example: Extracting table data from a webpage (e.g., a financial or academic
website):

python
Копіювати код
table = soup.find('table')
for row in table.find_all('tr'):
cols = row.find_all('td')
data = [col.text for col in cols]
print(data)

5. Practical Parsing Task: Extracting Table Data


o Scenario: Extract a table of product prices from a sample e-commerce page or
extract statistical data from a public website.
o Use BeautifulSoup to:
 Identify the table in the HTML.
 Extract the header and all rows.
 Format the extracted data into a structured format (like a list of dictionaries
or a CSV file).
6. Handling Multiple Pages and Pagination
o Many websites display data across multiple pages. Understanding how pagination
works is crucial.
o Example: Scraping product listings where data is spread over multiple pages with
links like ?page=2, ?page=3, etc.
 Finding Pagination Links:

python
Копіювати код
pagination = soup.find('ul', class_='pagination')
next_page = pagination.find('a', {'rel': 'next'}).get('href')

o Automating scraping across multiple pages using a loop:

python
Копіювати код
while next_page:
response = requests.get(next_page_url)
soup = BeautifulSoup(response.text, 'html.parser')
# Process data from the current page
next_page = find_next_page(soup)

7. Advanced Selection Techniques with CSS Selectors


o Using CSS selectors with BeautifulSoup’s select method:
 CSS selectors provide an easy way to extract elements based on more
complex combinations of classes, IDs, and other attributes.

python
Копіювати код
soup.select('div.class-name > a[href]')

8. Dealing with Incomplete or Missing Data


o Not all web pages are perfectly structured. You may encounter missing fields or
inconsistent HTML tags.
o Strategies for handling missing or broken elements in the page:

python
Копіювати код
data = soup.find('div', class_='product-info')
if data:
# Extract information
else:
print("Data missing")

9. Practical Parsing Task: Scraping Multi-Page Listings


o Scenario: Use an e-commerce or listing website (or a public dataset site) to scrape a
list of items across multiple pages.
 Write a loop that collects data from each page.
 Store the results in a structured format (JSON or CSV).
10. Homework
o Find a website with tables (e.g., sports statistics, financial reports, product listings).
o Extract the table data and save it into a CSV file.
o Identify whether the site uses pagination and attempt to scrape all pages, if
applicable.

Key Takeaways:

 Navigating and extracting data from complex HTML structures, including handling nested
elements, tables, and forms.
 Understanding how to automate scraping across multiple pages.
 Using CSS selectors for advanced element selection.

By the end of this lesson, you will be able to scrape more complex and structured data, such as
forms and tables, and navigate multi-page websites. This will prepare you to tackle even more
advanced tasks like handling JavaScript-heavy sites in the following lessons.

You might also like