In this section, you'll learn more about getting information from the web. You've used API responses in a previous module of this course, and you'll look at them some more in this section when you use Python's json module from the standard library to process the structured JSON data that many APIs return.
When to Use Web Scraping
But what can you do when there's no API for some information that you want to access?
Just like you can access any information on the Internet through your web browser, you can also access that information using Python through a process called web scraping.
Working through this section will give you an opportunity to keep practicing OOP concepts as you get to know the beautifulsoup4 package and use it as part of your web scraper to extract information from the HTML on websites.
Because the Internet is messy, web scraping is often messy, too. This means you'll get a lot of chances to run into errors and put your exception-handling skills into practice as well.
In some of the previous sections of this module, you've spent a lot of time thinking about food ingredients and soups, but you haven't seen a lot of good recipes yet! If only there was some resource of recipes that you could access to get ideas for what to cook next...
CodingNomads Recipe Collection
CodingNomads' chef section has got you covered! This Recipe Collection contains a decent amount of recipes gathered from the r/recipes subreddit:
Head over to the Recipe Collection page and take a look around. In this section, you'll learn how to scrape the information from this page to get access to recipes that are posted here.
Maybe you know about the trope of a scrappy recipe book written by many people through many generations, with papers sticking out here and there? A collection of instructions on how to make good food, combining the wisdom of many into one resource:
That's what you'll work on creating here, only it'll be digital, and you'll source the recipe information from the provided recipe collection.
Info: You'll learn how to scrape the information from a page we set up explicitly to allow you to scrape it. Reddit provides access to their public information through an API, but their Terms of Service ask you not to scrape their site. Python web scraping gives you the power to access a lot of content on the internet, but this power also comes with the responsibility to use it wisely and respectfully. For fair use, you'll always want to respect the preferences that a website states in regard to scraping.
Instructions
The instructions in this course section won't take you all the way to a polished collection of delicious recipes, but it'll show you the basics of how to scrape the web for information using Python, requests, and beautifulsoup4.
As always, you're encouraged to keep going. By the end of this section, you'll know how to get information from the CodingNomads recipe collection, a page that displays information without offering an API.
This opens up the possibility to do more! For example, you could combine the web scraping code you'll write with the food-related classes that you built earlier in this module. Then, you could create a CLI that allows you to enter a list of ingredient names. Your program could then search a database of text that you scraped off the recipe collection for recipes that can be cooked with the provided ingredients.
Or, you can completely ditch the topic of recipes and use your Python web scraping skills to gather any other type of information from the internet.
How to Web Scrape
This section will focus on introducing you to the general process that you should follow for every web scraping task:
- Inspect
- Scrape
- Parse
Whichever site you want to scrape and whatever information you're looking for, these steps will help you get there. Also, keep in mind that sites on the web constantly change and evolve and that there are so many different sites on the internet that there's no single program you could write that will be able to scrape any type of information constantly and reliably.
Note: Always be respectful when scraping data from websites! Some websites don't mind scrapers, but others don't like it when automatic scrapers gather data without their permission. You probably won't run into any problems if you're scraping a page respectfully for educational purposes. You should still take a look at the Terms of Service of a page before you start working on a web scraper project. You can learn more about the legal aspects of web scraping in Legal Perspectives on Scraping Data From The Modern Web.
Web scraping includes building programs that break often. That's in the nature of the topic. You're interfacing with a couple of different technologies, and it'll get messy and frustrating.
However, if you take the general steps to heart and train your persistence, then you'll keep getting quicker at writing scripts to gather information from the internet. And this opens up many doors to building programs and analyzing data.
To warm up, you'll start by revisiting API calls and handling JSON responses.
Additional Resources
- Real Python: Beautiful Soup: Build a Web Scraper With Python
- Requests Documentation: Requests: HTTP for Humans™
- Crummy: Beautiful Soup Documentation
Summary: What is a Web Scraper
- When web scraping with Python, the data you fetch was made to be looked at by humans and, therefore is messier than using an API
- When website structures change, often your web scraper needs to adapt as well
- Python web scraping gives you access to a large amount of data on the internet
- Here, you will web scrape using Python,
requestsandbeautifulsoup4
Main Steps of a Web Scraper
- Inspect
- Scrape
- Parse