{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Web scraping the President's lies in 16 lines of Python\n", "\n", "*Created by Kevin Markham of [Data School](http://www.dataschool.io/). Hosted on [GitHub](https://github.com/justmarkham/trump-lies).*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "This an introductory tutorial on web scraping in Python. All that is required to follow along is a basic understanding of the Python programming language.\n", "\n", "By the end of this tutorial, you will be able to scrape data from a static web page using the **requests** and **Beautiful Soup** libraries, and export that data into a structured text file using the **pandas** library." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Outline\n", "\n", "- What is web scraping?\n", "- Examining the New York Times article\n", " - Examining the HTML\n", " - Fact 1: HTML consists of tags\n", " - Fact 2: Tags can have attributes\n", " - Fact 3: Tags can be nested\n", "- Reading the web page into Python\n", "- Parsing the HTML using Beautiful Soup\n", " - Collecting all of the records\n", " - Extracting the date\n", " - Extracting the lie\n", " - Extracting the explanation\n", " - Extracting the URL\n", " - Recap: Beautiful Soup methods and attributes\n", "- Building the dataset\n", " - Applying a tabular data structure\n", " - Exporting the dataset to a CSV file\n", "- Summary: 16 lines of Python code\n", " - Appendix A: Web scraping advice\n", " - Appendix B: Web scraping resources\n", " - Appendix C: Alternative syntax for Beautiful Soup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is web scraping?\n", "\n", "On July 21, 2017, the New York Times updated an opinion article called [Trump's Lies](https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html), detailing every public lie the President has told since taking office. Because this is a newspaper, the information was (of course) published as a block of text. This is a great format for human consumption, but it can't easily be understood by a computer. **In this tutorial, we'll extract the President's lies from the New York Times article and store them in a structured dataset.**\n", "\n", "This is a common scenario: You find a web page that contains data you want to analyze, but it's not presented in a format that you can easily download and read into your favorite data analysis tool. You might imagine manually copying and pasting the data into a spreadsheet, but in most cases, that is way too time consuming. A technique called **web scraping** is a useful way to automate this process.\n", "\n", "What is web scraping? It's the process of extracting information from a web page **by taking advantage of patterns** in the web page's underlying code. Let's start looking for these patterns!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Examining the New York Times article\n", "\n", "Here's the way the article presented the information:\n", "\n", "\n", "\n", "When converting this into a dataset, **you can think of each lie as a \"record\" with four fields:**\n", "\n", "1. The date of the lie.\n", "2. The lie itself (as a quotation).\n", "3. The writer's brief explanation of why it was a lie.\n", "4. The URL of an article that substantiates the claim that it was a lie.\n", "\n", "Importantly, those fields have different formatting, which is consistent throughout the article: the date is bold red text, the lie is \"regular\" text, the explanation is gray italics text, and the URL is linked from the gray italics text.\n", "\n", "**Why does the formatting matter?** Because it's very likely that the code underlying the web page \"tags\" those fields differently, and we can take advantage of that pattern when scraping the page. Let's take a look at the source code, known as HTML:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Examining the HTML\n", "\n", "To view the HTML code that generates a web page, you right click on it and select \"View Page Source\" in Chrome or Firefox, \"View Source\" in Internet Explorer, or \"Show Page Source\" in Safari. (If that option doesn't appear in Safari, just open Safari Preferences, select the Advanced tab, and check \"Show Develop menu in menu bar\".)\n", "\n", "Here are the first few lines you will see if you view the source of the New York Times article:\n", "\n", "\n", "\n", "Let's locate the **first lie** by searching the HTML for the text \"iraq\":\n", "\n", "\n", "\n", "Thankfully, you only have to understand **three basic facts** about HTML in order to get started with web scraping!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fact 1: HTML consists of tags\n", "\n", "You can see that the HTML contains the article text, along with \"tags\" (specified using angle brackets) that \"mark up\" the text. (\"HTML\" stands for Hyper Text Markup Language.)\n", "\n", "For example, one tag is `<strong>`, which means \"use bold formatting\". There is a `<strong>` tag before \"Jan. 21\" and a `</strong>` tag after it. The first is an \"opening tag\" and the second is a \"closing tag\" (denoted by the `/`), which indicates to the web browser **where to start and stop applying the formatting.** In other words, this tag tells the web browser to make the text \"Jan. 21\" bold. (Don't worry about the ` ` - we'll deal with that later.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fact 2: Tags can have attributes\n", "\n", "HTML tags can have \"attributes\", which are specified in the opening tag. For example, `<span class=\"short-desc\">` indicates that this particular `<span>` tag has a `class` attribute with a value of `short-desc`.\n", "\n", "For the purpose of web scraping, **you don't actually need to understand** the meaning of `<span>`, `class`, or `short-desc`. Instead, you just need to recognize that tags can have attributes, and that they are specified in this particular way." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fact 3: Tags can be nested\n", "\n", "Let's pretend my HTML code said:\n", "\n", "`Hello <strong><em>Data School</em> students</strong>`\n", "\n", "The text **Data School students** would be bold, because all of that text is between the opening `<strong>` tag and the closing `</strong>` tag. The text ***Data School*** would also be in italics, because the `<em>` tag means \"use italics\". The text \"Hello\" would not be bold or italics, because it's not within either the `<strong>` or `<em>` tags. Thus, it would appear as follows:\n", "\n", "Hello ***Data School* students**\n", "\n", "The central point to take away from this example is that **tags \"mark up\" text from wherever they open to wherever they close,** regardless of whether they are nested within other tags.\n", "\n", "Got it? You now know enough about HTML in order to start web scraping!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading the web page into Python\n", "\n", "The first thing we need to do is to read the HTML for this article into Python, which we'll do using the [requests](http://docs.python-requests.org/en/master/) library. (If you don't have it, you can `pip install requests` from the command line.)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import requests\n", "r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code above fetches our web page from the URL, and stores the result in a \"response\" object called `r`. That response object has a `text` attribute, which contains the same HTML code we saw when viewing the source from our web browser:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "<!DOCTYPE html>\n", "<!--[if (gt IE 9)|!(IE)]> <!--><html lang=\"en\" class=\"no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive\" itemid=\"https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html\" itemtype=\"http://schema.org/NewsArticle\" itemscope xmlns:og=\"http://opengraphprotocol.org/schema/\"><!--<![endif]-->\n", "<!--[if IE 9]> <html lang=\"en\" class=\"no-js ie9 lt-ie10 page-interactive section-opinion page\n" ] } ], "source": [ "# print the first 500 characters of the HTML\n", "print(r.text[0:500])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parsing the HTML using Beautiful Soup\n", "\n", "We're going to parse the HTML using the [Beautiful Soup 4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library, which is a popular Python library for web scraping. (If you don't have it, you can `pip install beautifulsoup4` from the command line.)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "soup = BeautifulSoup(r.text, 'html.parser')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code above parses the HTML (stored in `r.text`) into a special object called `soup` that the Beautiful Soup library understands. In other words, Beautiful Soup is **reading the HTML and making sense of its structure.**\n", "\n", "(Note that `html.parser` is the parser included with the Python standard library, though other parsers can be used by Beautiful Soup. See [differences between parsers](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers) to learn more.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Collecting all of the records\n", "\n", "The Python code above is the standard code I use with every web scraping project. Now, we're going to start **taking advantage of the patterns we noticed in the article formatting** to build our dataset!\n", "\n", "Let's take another look at the article, and compare it with the HTML:\n", "\n", "\n", "\n", "\n", "\n", "You might have noticed that each record has the following format:\n", "\n", "`<span class=\"short-desc\"><strong> DATE </strong> LIE <span class=\"short-truth\"><a href=\"URL\"> EXPLANATION </a></span></span>`\n", "\n", "There's an outer `<span>` tag, and then nested within it is a `<strong>` tag plus another `<span>` tag, which itself contains an `<a>` tag. All of these tags affect the formatting of the text. And because the New York Times wants each record to appear in a consistent way in your web browser, we know that **each record will be tagged in a consistent way in the HTML.** This is the pattern that allows us to build our dataset!\n", "\n", "Let's ask Beautiful Soup to **find all of the records:**" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "results = soup.find_all('span', attrs={'class':'short-desc'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This code searches the `soup` object for all `<span>` tags with the attribute `class=\"short-desc\"`. It returns a special Beautiful Soup object (called a \"ResultSet\") containing the search results.\n", "\n", "`results` acts like a **Python list**, so we can check its length:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "116" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are 116 results, which seems reasonable given the length of the article. (If this number did not seem reasonable, we would examine the HTML further to determine if our assumptions about the patterns in the HTML were incorrect.)\n", "\n", "We can also slice the object like a list, in order to examine the **first three results:**" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[<span class=\"short-desc\"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class=\"short-truth\"><a href=\"https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the\" target=\"_blank\">(He was for an invasion before he was against it.)</a></span></span>,\n", " <span class=\"short-desc\"><strong>Jan. 21 </strong>“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” <span class=\"short-truth\"><a href=\"http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/\" target=\"_blank\">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span></span>,\n", " <span class=\"short-desc\"><strong>Jan. 23 </strong>“Between 3 million and 5 million illegal votes caused me to lose the popular vote.” <span class=\"short-truth\"><a href=\"https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html\" target=\"_blank\">(There's no evidence of illegal voting.)</a></span></span>]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results[0:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll also check that the **last result** in this object matches the last record in the article:\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<span class=\"short-desc\"><strong>July 19 </strong>“But the F.B.I. person really reports directly to the president of the United States, which is interesting.” <span class=\"short-truth\"><a href=\"https://www.usatoday.com/story/news/politics/onpolitics/2017/07/20/fbi-director-reports-justice-department-not-president/495094001/\" target=\"_blank\">(He reports directly to the attorney general.)</a></span></span>" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results[-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks good!\n", "\n", "We have now collected all 116 of the records, but we still need to **separate each record into its four components** (date, lie, explanation, and URL) in order to give the dataset some structure." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extracting the date\n", "\n", "Web scraping is often an iterative process, in which you experiment with your code until it works exactly as you desire. To simplify the experimentation, we'll start by only working with the **first record** in the `results` object, and then later on we'll modify our code to use a loop:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<span class=\"short-desc\"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class=\"short-truth\"><a href=\"https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the\" target=\"_blank\">(He was for an invasion before he was against it.)</a></span></span>" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_result = results[0]\n", "first_result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although `first_result` may look like a Python string, you'll notice that there are no quote marks around it. Instead, it's another special Beautiful Soup object (called a \"Tag\") that has specific methods and attributes.\n", "\n", "In order to locate the date, we can use its `find()` method to **find a single tag** that matches a specific pattern, in contrast to the `find_all()` method we used above to **find all tags** that match a pattern:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<strong>Jan. 21 </strong>" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_result.find('strong')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This code searches `first_result` for the first instance of a `<strong>` tag, and again returns a Beautiful Soup \"Tag\" object (not a string).\n", "\n", "Since we want to **extract the text between the opening and closing tags**, we can access its `text` attribute, which does in fact return a regular Python string:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Jan. 21\\xa0'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_result.find('strong').text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is `\\xa0`? You don't actually need to know this, but it's called an \"escape sequence\" that represents the ` ` character we saw earlier in the HTML source.\n", "\n", "However, you do need to know that **an escape sequence represents a single character** within a string. Let's slice it off from the end of the string:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Jan. 21'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_result.find('strong').text[0:-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we're going to add the year, since we don't want our dataset to include ambiguous dates:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Jan. 21, 2017'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_result.find('strong').text[0:-1] + ', 2017'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extracting the lie\n", "\n", "Let's take another look at `first_result`:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<span class=\"short-desc\"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class=\"short-truth\"><a href=\"https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the\" target=\"_blank\">(He was for an invasion before he was against it.)</a></span></span>" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our goal is to extract the two sentences about Iraq. Unfortunately, there isn't a pair of opening and closing tags that starts **immediately before the lie** and ends **immediately after the lie**. Therefore, we're going to have to use a different technique:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[<strong>Jan. 21 </strong>,\n", " \"“I wasn't a fan of Iraq. I didn't want to go into Iraq.” \",\n", " <span class=\"short-truth\"><a href=\"https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the\" target=\"_blank\">(He was for an invasion before he was against it.)</a></span>]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_result.contents" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `first_result` \"Tag\" has a `contents` attribute, which returns a Python list containing its \"children\". What are children? They are the **Tags and strings that are nested within a Tag.**\n", "\n", "We can slice this list to extract the second element:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"“I wasn't a fan of Iraq. I didn't want to go into Iraq.” \"" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_result.contents[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we'll slice off the curly quotation marks as well as the extra space at the end:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"I wasn't a fan of Iraq. I didn't want to go into Iraq.\"" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_result.contents[1][1:-2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extracting the explanation\n", "\n", "Based upon what you've seen already, you might have figured out that we have at least **two options** for how we extract the third component of the record, which is the writer's explanation of why the President's statement was a lie.\n", "\n", "The **first option** is to slice the `contents` attribute, like we did when extracting the lie:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<span class=\"short-truth\"><a href=\"https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the\" target=\"_blank\">(He was for an invasion before he was against it.)</a></span>" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_result.contents[2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The **second option** is to search for the surrounding tag, like we did when extracting the date:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<a href=\"https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the\" target=\"_blank\">(He was for an invasion before he was against it.)</a>" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_result.find('a')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Either way, we can access the `text` attribute and then slice off the opening and closing parentheses:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'He was for an invasion before he was against it.'" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_result.find('a').text[1:-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extracting the URL\n", "\n", "Finally, we want to extract the URL of the article that substantiates the writer's claim that the President was lying.\n", "\n", "Let's examine the `<a>` tag within `first_result`:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<a href=\"https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the\" target=\"_blank\">(He was for an invasion before he was against it.)</a>" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_result.find('a')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So far in this tutorial, we have been extracting text that is **between tags**. In this case, the text we want to extract is located **within the tag itself**. Specifically, we want to access the value of the `href` attribute within the `<a>` tag.\n", "\n", "Beautiful Soup treats tag attributes and their values like **key-value pairs in a dictionary:** you put the attribute name in brackets (like a dictionary key), and you get back the attribute's value:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_result.find('a')['href']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Recap: Beautiful Soup methods and attributes\n", "\n", "Before we finish building the dataset, I want to summarize a few ways you can interact with Beautiful Soup objects.\n", "\n", "You can apply these **two methods** to either the initial `soup` object or a Tag object (such as `first_result`):\n", "\n", "- `find()`: searches for the first matching tag, and returns a Tag object\n", "- `find_all()`: searches for all matching tags, and returns a ResultSet object (which you can treat like a list of Tags)\n", "\n", "You can extract information from a Tag object (such as `first_result`) using these **two attributes:**\n", "\n", "- `text`: extracts the text of a Tag, and returns a string\n", "- `contents`: extracts the children of a Tag, and returns a list of Tags and strings\n", "\n", "It's important to keep track of whether you are interacting with a Tag, ResultSet, list, or string, because that affects which methods and attributes you can access.\n", "\n", "And of course, there are many more methods and attributes available to you, which are described in the [Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building the dataset\n", "\n", "Now that we've figured out how to extract the four components of `first_result`, we can **create a loop to repeat this process** on all 116 `results`. We'll store the output in a **list of tuples** called `records`:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "records = []\n", "for result in results:\n", " date = result.find('strong').text[0:-1] + ', 2017'\n", " lie = result.contents[1][1:-2]\n", " explanation = result.find('a').text[1:-1]\n", " url = result.find('a')['href']\n", " records.append((date, lie, explanation, url))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since there were 116 `results`, we should have 116 `records`:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "116" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(records)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's do a quick spot check of the first three records:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Jan. 21, 2017',\n", " \"I wasn't a fan of Iraq. I didn't want to go into Iraq.\",\n", " 'He was for an invasion before he was against it.',\n", " 'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'),\n", " ('Jan. 21, 2017',\n", " 'A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.',\n", " 'Trump was on the cover 11 times and Nixon appeared 55 times.',\n", " 'http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/'),\n", " ('Jan. 23, 2017',\n", " 'Between 3 million and 5 million illegal votes caused me to lose the popular vote.',\n", " \"There's no evidence of illegal voting.\",\n", " 'https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html')]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "records[0:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks good!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Applying a tabular data structure\n", "\n", "The last major step in this process is to apply a tabular data structure to our existing structure (which is a list of tuples). We're going to do this using the [pandas](http://pandas.pydata.org/) library, an incredibly popular Python library for data analysis and manipulation. (If you don't have it, here are the [installation instructions](http://pandas.pydata.org/pandas-docs/stable/install.html).)\n", "\n", "The primary data structure in pandas is the \"DataFrame\", which is suitable for tabular data with columns of different types, **similar to an Excel spreadsheet or SQL table.** We can convert our list of tuples into a DataFrame by passing it to the DataFrame constructor and specifying the desired column names:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.DataFrame(records, columns=['date', 'lie', 'explanation', 'url'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The DataFrame includes a `head()` method, which allows you to examine the top of the DataFrame:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style>\n", " .dataframe thead tr:only-child th {\n", " text-align: right;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: left;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>date</th>\n", " <th>lie</th>\n", " <th>explanation</th>\n", " <th>url</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>Jan. 21, 2017</td>\n", " <td>I wasn't a fan of Iraq. I didn't want to go in...</td>\n", " <td>He was for an invasion before he was against it.</td>\n", " <td>https://www.buzzfeed.com/andrewkaczynski/in-20...</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Jan. 21, 2017</td>\n", " <td>A reporter for Time magazine — and I have been...</td>\n", " <td>Trump was on the cover 11 times and Nixon appe...</td>\n", " <td>http://nation.time.com/2013/11/06/10-things-yo...</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>Jan. 23, 2017</td>\n", " <td>Between 3 million and 5 million illegal votes ...</td>\n", " <td>There's no evidence of illegal voting.</td>\n", " <td>https://www.nytimes.com/2017/01/23/us/politics...</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>Jan. 25, 2017</td>\n", " <td>Now, the audience was the biggest ever. But th...</td>\n", " <td>Official aerial photos show Obama's 2009 inaug...</td>\n", " <td>https://www.nytimes.com/2017/01/21/us/politics...</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>Jan. 25, 2017</td>\n", " <td>Take a look at the Pew reports (which show vot...</td>\n", " <td>The report never mentioned voter fraud.</td>\n", " <td>https://www.nytimes.com/2017/01/24/us/politics...</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " date lie \\\n", "0 Jan. 21, 2017 I wasn't a fan of Iraq. I didn't want to go in... \n", "1 Jan. 21, 2017 A reporter for Time magazine — and I have been... \n", "2 Jan. 23, 2017 Between 3 million and 5 million illegal votes ... \n", "3 Jan. 25, 2017 Now, the audience was the biggest ever. But th... \n", "4 Jan. 25, 2017 Take a look at the Pew reports (which show vot... \n", "\n", " explanation \\\n", "0 He was for an invasion before he was against it. \n", "1 Trump was on the cover 11 times and Nixon appe... \n", "2 There's no evidence of illegal voting. \n", "3 Official aerial photos show Obama's 2009 inaug... \n", "4 The report never mentioned voter fraud. \n", "\n", " url \n", "0 https://www.buzzfeed.com/andrewkaczynski/in-20... \n", "1 http://nation.time.com/2013/11/06/10-things-yo... \n", "2 https://www.nytimes.com/2017/01/23/us/politics... \n", "3 https://www.nytimes.com/2017/01/21/us/politics... \n", "4 https://www.nytimes.com/2017/01/24/us/politics... " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The numbers on the left side of the DataFrame are known as the \"index\", which act as identifiers for the rows. Because we didn't specify an index, it was automatically assigned as the integers 0 to 115.\n", "\n", "We can examine the bottom of the DataFrame using the `tail()` method:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style>\n", " .dataframe thead tr:only-child th {\n", " text-align: right;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: left;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>date</th>\n", " <th>lie</th>\n", " <th>explanation</th>\n", " <th>url</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>111</th>\n", " <td>July 6, 2017</td>\n", " <td>As a result of this insistence, billions of do...</td>\n", " <td>NATO countries agreed to meet defense spending...</td>\n", " <td>http://www.nbcnews.com/politics/donald-trump/f...</td>\n", " </tr>\n", " <tr>\n", " <th>112</th>\n", " <td>July 17, 2017</td>\n", " <td>We’ve signed more bills — and I’m talking abou...</td>\n", " <td>Clinton, Carter, Truman, and F.D.R. had signed...</td>\n", " <td>https://www.nytimes.com/2017/07/17/us/politics...</td>\n", " </tr>\n", " <tr>\n", " <th>113</th>\n", " <td>July 19, 2017</td>\n", " <td>Um, the Russian investigation — it’s not an in...</td>\n", " <td>It is.</td>\n", " <td>http://time.com/4823514/donald-trump-investiga...</td>\n", " </tr>\n", " <tr>\n", " <th>114</th>\n", " <td>July 19, 2017</td>\n", " <td>I heard that Harry Truman was first, and then ...</td>\n", " <td>Presidents Clinton, Carter, Truman, and F.D.R....</td>\n", " <td>https://www.nytimes.com/2017/07/17/us/politics...</td>\n", " </tr>\n", " <tr>\n", " <th>115</th>\n", " <td>July 19, 2017</td>\n", " <td>But the F.B.I. person really reports directly ...</td>\n", " <td>He reports directly to the attorney general.</td>\n", " <td>https://www.usatoday.com/story/news/politics/o...</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " date lie \\\n", "111 July 6, 2017 As a result of this insistence, billions of do... \n", "112 July 17, 2017 We’ve signed more bills — and I’m talking abou... \n", "113 July 19, 2017 Um, the Russian investigation — it’s not an in... \n", "114 July 19, 2017 I heard that Harry Truman was first, and then ... \n", "115 July 19, 2017 But the F.B.I. person really reports directly ... \n", "\n", " explanation \\\n", "111 NATO countries agreed to meet defense spending... \n", "112 Clinton, Carter, Truman, and F.D.R. had signed... \n", "113 It is. \n", "114 Presidents Clinton, Carter, Truman, and F.D.R.... \n", "115 He reports directly to the attorney general. \n", "\n", " url \n", "111 http://www.nbcnews.com/politics/donald-trump/f... \n", "112 https://www.nytimes.com/2017/07/17/us/politics... \n", "113 http://time.com/4823514/donald-trump-investiga... \n", "114 https://www.nytimes.com/2017/07/17/us/politics... \n", "115 https://www.usatoday.com/story/news/politics/o... " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Did you notice that \"January\" is abbreviated, while \"July\" is not? It's best to format your data consistently, and so we're going to convert the date column to pandas' special \"datetime\" format:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df['date'] = pd.to_datetime(df['date'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code above converts the \"date\" column to datetime format, and then overwrites the existing \"date\" column. (Notice that we did not have to tell pandas that the column was originally in \"MONTH DAY, YEAR\" format - **pandas just figured it out!**)\n", "\n", "Let's take a look at the results:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style>\n", " .dataframe thead tr:only-child th {\n", " text-align: right;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: left;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>date</th>\n", " <th>lie</th>\n", " <th>explanation</th>\n", " <th>url</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>2017-01-21</td>\n", " <td>I wasn't a fan of Iraq. I didn't want to go in...</td>\n", " <td>He was for an invasion before he was against it.</td>\n", " <td>https://www.buzzfeed.com/andrewkaczynski/in-20...</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2017-01-21</td>\n", " <td>A reporter for Time magazine — and I have been...</td>\n", " <td>Trump was on the cover 11 times and Nixon appe...</td>\n", " <td>http://nation.time.com/2013/11/06/10-things-yo...</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2017-01-23</td>\n", " <td>Between 3 million and 5 million illegal votes ...</td>\n", " <td>There's no evidence of illegal voting.</td>\n", " <td>https://www.nytimes.com/2017/01/23/us/politics...</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>2017-01-25</td>\n", " <td>Now, the audience was the biggest ever. But th...</td>\n", " <td>Official aerial photos show Obama's 2009 inaug...</td>\n", " <td>https://www.nytimes.com/2017/01/21/us/politics...</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>2017-01-25</td>\n", " <td>Take a look at the Pew reports (which show vot...</td>\n", " <td>The report never mentioned voter fraud.</td>\n", " <td>https://www.nytimes.com/2017/01/24/us/politics...</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " date lie \\\n", "0 2017-01-21 I wasn't a fan of Iraq. I didn't want to go in... \n", "1 2017-01-21 A reporter for Time magazine — and I have been... \n", "2 2017-01-23 Between 3 million and 5 million illegal votes ... \n", "3 2017-01-25 Now, the audience was the biggest ever. But th... \n", "4 2017-01-25 Take a look at the Pew reports (which show vot... \n", "\n", " explanation \\\n", "0 He was for an invasion before he was against it. \n", "1 Trump was on the cover 11 times and Nixon appe... \n", "2 There's no evidence of illegal voting. \n", "3 Official aerial photos show Obama's 2009 inaug... \n", "4 The report never mentioned voter fraud. \n", "\n", " url \n", "0 https://www.buzzfeed.com/andrewkaczynski/in-20... \n", "1 http://nation.time.com/2013/11/06/10-things-yo... \n", "2 https://www.nytimes.com/2017/01/23/us/politics... \n", "3 https://www.nytimes.com/2017/01/21/us/politics... \n", "4 https://www.nytimes.com/2017/01/24/us/politics... " ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style>\n", " .dataframe thead tr:only-child th {\n", " text-align: right;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: left;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>date</th>\n", " <th>lie</th>\n", " <th>explanation</th>\n", " <th>url</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>111</th>\n", " <td>2017-07-06</td>\n", " <td>As a result of this insistence, billions of do...</td>\n", " <td>NATO countries agreed to meet defense spending...</td>\n", " <td>http://www.nbcnews.com/politics/donald-trump/f...</td>\n", " </tr>\n", " <tr>\n", " <th>112</th>\n", " <td>2017-07-17</td>\n", " <td>We’ve signed more bills — and I’m talking abou...</td>\n", " <td>Clinton, Carter, Truman, and F.D.R. had signed...</td>\n", " <td>https://www.nytimes.com/2017/07/17/us/politics...</td>\n", " </tr>\n", " <tr>\n", " <th>113</th>\n", " <td>2017-07-19</td>\n", " <td>Um, the Russian investigation — it’s not an in...</td>\n", " <td>It is.</td>\n", " <td>http://time.com/4823514/donald-trump-investiga...</td>\n", " </tr>\n", " <tr>\n", " <th>114</th>\n", " <td>2017-07-19</td>\n", " <td>I heard that Harry Truman was first, and then ...</td>\n", " <td>Presidents Clinton, Carter, Truman, and F.D.R....</td>\n", " <td>https://www.nytimes.com/2017/07/17/us/politics...</td>\n", " </tr>\n", " <tr>\n", " <th>115</th>\n", " <td>2017-07-19</td>\n", " <td>But the F.B.I. person really reports directly ...</td>\n", " <td>He reports directly to the attorney general.</td>\n", " <td>https://www.usatoday.com/story/news/politics/o...</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " date lie \\\n", "111 2017-07-06 As a result of this insistence, billions of do... \n", "112 2017-07-17 We’ve signed more bills — and I’m talking abou... \n", "113 2017-07-19 Um, the Russian investigation — it’s not an in... \n", "114 2017-07-19 I heard that Harry Truman was first, and then ... \n", "115 2017-07-19 But the F.B.I. person really reports directly ... \n", "\n", " explanation \\\n", "111 NATO countries agreed to meet defense spending... \n", "112 Clinton, Carter, Truman, and F.D.R. had signed... \n", "113 It is. \n", "114 Presidents Clinton, Carter, Truman, and F.D.R.... \n", "115 He reports directly to the attorney general. \n", "\n", " url \n", "111 http://www.nbcnews.com/politics/donald-trump/f... \n", "112 https://www.nytimes.com/2017/07/17/us/politics... \n", "113 http://time.com/4823514/donald-trump-investiga... \n", "114 https://www.nytimes.com/2017/07/17/us/politics... \n", "115 https://www.usatoday.com/story/news/politics/o... " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not only is the date column now consistently formatted, but pandas also provides a wealth of [date-related functionality](https://pandas.pydata.org/pandas-docs/stable/timeseries.html) because it's in datetime format." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exporting the dataset to a CSV file\n", "\n", "Finally, we'll use pandas to export the DataFrame to a CSV (comma-separated value) file, which is the simplest and most common way to **store tabular data in a text file:**" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df.to_csv('trump_lies.csv', index=False, encoding='utf-8')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We set the `index` parameter to `False` to tell pandas that we don't need it to include the index (the integers 0 to 115) in the CSV file. You should be able to find this file in your working directory, and open it in any text editor or spreadsheet program!\n", "\n", "In the future, you can rebuild this DataFrame by reading the CSV file back into pandas:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df = pd.read_csv('trump_lies.csv', parse_dates=['date'], encoding='utf-8')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to learn a lot more about the pandas library, you can watch my video series, [Easier data analysis in Python with pandas](http://www.dataschool.io/easier-data-analysis-with-pandas/), or check out my [top 8 resources for learning pandas](http://www.dataschool.io/best-python-pandas-resources/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary: 16 lines of Python code\n", "\n", "Here are the 16 lines of code that we used to scrape the web page, extract the relevant data, convert it into a tabular dataset, and export it to a CSV file:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import requests\n", "r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')\n", "\n", "from bs4 import BeautifulSoup\n", "soup = BeautifulSoup(r.text, 'html.parser')\n", "results = soup.find_all('span', attrs={'class':'short-desc'})\n", "\n", "records = []\n", "for result in results:\n", " date = result.find('strong').text[0:-1] + ', 2017'\n", " lie = result.contents[1][1:-2]\n", " explanation = result.find('a').text[1:-1]\n", " url = result.find('a')['href']\n", " records.append((date, lie, explanation, url))\n", "\n", "import pandas as pd\n", "df = pd.DataFrame(records, columns=['date', 'lie', 'explanation', 'url'])\n", "df['date'] = pd.to_datetime(df['date'])\n", "df.to_csv('trump_lies.csv', index=False, encoding='utf-8')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Appendix A: Web scraping advice\n", "\n", "- Web scraping works best with **static, well-structured web pages**. Dynamic or interactive content on a web page is often not accessible through the HTML source, which makes scraping it much harder!\n", "- Web scraping is a \"fragile\" approach for building a dataset. The HTML on a page you are scraping can **change at any time**, which may cause your scraper to stop working.\n", "- If you can **download the data** you need from a website, or if the website provides an **API with data access**, those approaches are preferable to scraping since they are easier to implement and less likely to break.\n", "- If you are **scraping a lot of pages** from the same website (in rapid succession), it's best to insert delays in your code so that you don't overwhelm the website with requests. If the website decides you are causing a problem, they can block your IP address (which may affect everyone in your building!)\n", "- Before scraping a website, you should review its **robots.txt file** (also known as the [Robots exclusion standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard)) to check whether you are \"allowed\" to scrape their website. (Here is the [robots.txt file for nytimes.com](http://www.nytimes.com/robots.txt).)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Appendix B: Web scraping resources\n", "\n", "- The [Beautiful Soup documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) is written like a tutorial, and is worth reading to gain a detailed understanding of the library.\n", "- For more Beautiful Soup examples, see [Web Scraping 101 with Python](http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/), [More web scraping with Python](http://www.gregreda.com/2013/04/29/more-web-scraping-with-python/), and this [web scraping lesson](http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html) from Stanford's \"Text As Data\" course.\n", "- [Web Scraping with Python](https://www.youtube.com/watch?v=p1iX0uxM1w8) is a 3-hour video tutorial covering Beautiful Soup and other scraping tools. (The [slides](https://docs.google.com/presentation/d/1uHM_esB13VuSf7O1ScGueisnrtu-6usGFD3fs4z5YCE/edit#slide=id.p) and [code](https://github.com/kjam/python-web-scraping-tutorial) are also available.)\n", "- [Scrapy](http://scrapy.org/) is a popular application framework that is useful for more complex web scraping projects.\n", "- [How a Math Genius Hacked OkCupid to Find True Love](http://www.wired.com/2014/01/how-to-hack-okcupid/all/) and [How Netflix Reverse Engineered Hollywood](http://www.theatlantic.com/technology/archive/2014/01/how-netflix-reverse-engineered-hollywood/282679/?single_page=true) are two fun examples of using web scraping to build an interesting dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Appendix C: Alternative syntax for Beautiful Soup\n", "\n", "It's worth noting that Beautiful Soup actually offers multiple ways to express the same command. I tend to use the most verbose option, since I think it makes the code readable, but it's useful to be able to recognize the alternative syntax since you might see it used elsewhere.\n", "\n", "For example, you can **search for a tag** by accessing it like an attribute:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<strong>Jan. 21 </strong>" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# search for a tag by name\n", "first_result.find('strong')\n", "\n", "# shorter alternative: access it like an attribute\n", "first_result.strong" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also **search for multiple tags** a few different ways:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# search for multiple tags by name and attribute\n", "results = soup.find_all('span', attrs={'class':'short-desc'})\n", "\n", "# shorter alternative: if you don't specify a method, it's assumed to be find_all()\n", "results = soup('span', attrs={'class':'short-desc'})\n", "\n", "# even shorter alternative: you can specify the attribute as if it's a parameter\n", "results = soup('span', class_='short-desc')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more details, check out the [Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 1 }