HOLIDAY SALE! Save 50% on Membership with code HOLIDAY50. Save 15% on Mentorship with code HOLIDAY15.

4) Web Scraping Lesson

Introduction to BeautifulSoup

5 min to complete · By Martin Breuss

You've learned about the structure of the website you want to extract information from, you've fetched the HTML content from the Internet, and now you're left with a messy HTML soup. This is where the crucial step of parsing that information starts. You need to extract the information you're interested in from the string of data you received.

The Beautiful Soup package can be an invaluable help in doing this. You can use it to create a BeautifulSoup() object from the str containing the HTML that you received from your HTTP request.

Why Use Python BeautifulSoup

BeautifulSoup() objects have all sorts of useful methods and attributes that you can use to spoon through your data soup and pick out the pieces of information that you actually want to consume.

Install the BeautifulSoup Package

To get started, you first need to install the BeautifulSoup package inside your virtual environment:

python3 -m pip install beautifulsoup4

Import Python BeautifulSoup

Once the installation is complete, you can import the BeautifulSoup() class from the package inside your Python script:

from bs4 import BeautifulSoup

How to Use Python BeautifulSoup

Next, you can pass the HTML str that you ended up with in the previous lesson to the class constructor to create a BeautifulSoup() object of your specific web page that you can then interact with:

soup = BeautifulSoup(page.text)

Example Method: find_all

This object gives you access to many convenient methods to traverse the HTML string of your webpage. For example, you can quickly find all links that are on the page:

links = soup.find_all("a")

for link in links:
    print(link["href"])

In this code snippet, you used the .find_all() method and passed it the name of an HTML Element ("a") as its only argument. You then received a list of all HTML link elements on that page and saved it to the links variable.

Then, you used a for loop to iterate over each link element of the form [link text](URL), and picked out the URL by accessing the href attribute through a dictionary-like syntax (link["href"]).

With this short code snippet, you were able to find the locations of all recipe pages and save them to a variable you can work forward with.

Illustration of a lighthouse

Tasks

  • Replace the code above with a Python list comprehension that collects only the link URLs in the links variable. You should be able to do this using a single line of code.

Once you've collected all the links to the individual recipes, you're done with the information you want from the main page. In the next lesson, you'll scrape each of the detailed recipe pages and use Beautiful Soup to extract the information you're interested in.

Colorful illustration of a light bulb

Additional Resources

Summary: What is BeautifulSoup

  • BeautifulSoup allows you to parse the HTML of a page
  • BeautifulSoup has many convenient methods to parse the text content and pick out specific pieces of information
  • The find_all() method can pick out all links from a soup object