Scraping Data Within HTML Tables

Lesson 2

Overview

In this lesson, we will delve into the specifics of scraping data within HTML tables using Python and the Beautiful Soup library. By the end of this lesson, you will be able to effectively extract structured data from HTML tables and handle related challenges. The goals of this lesson are:

Understand the HTML table structure.
Learn to extract table data using BeautifulSoup.
Handle row data effectively.
Print and format the extracted data.

Let's get started!

Understanding HTML Tables

HTML tables are a widely used element in web development for displaying structured data. The basic structure of an HTML table is composed of the following tags:

<table>: Defines a table.
<tr> (Table Row): Defines a row in a table.
<th> (Table Header): Defines a header cell in a table.
<td> (Table Data): Defines a standard cell in a table.

Here is an example of a simple HTML table:

HTML, XML
1<table>
2    <tr>
3        <th>Author</th>
4        <th>Quote</th>
5    </tr>
6    <tr>
7        <td>Albert Einstein</td>
8        <td>"Life is like riding a bicycle. To keep your balance, you must keep moving."</td>
9    </tr>
10    <tr>
11        <td>Isaac Newton</td>
12        <td>"If I have seen further it is by standing on the shoulders of Giants."</td>
13    </tr>
14</table>

Extracting Table Element with Beautiful Soup

Now, let's start by fetching the webpage content and parsing it with BeautifulSoup.

Here’s how you can make an HTTP GET request and parse the HTML content:

Python
1import requests
2from bs4 import BeautifulSoup
3
4url = 'http://quotes.toscrape.com/tableful'
5response = requests.get(url)
6soup = BeautifulSoup(response.text, 'html.parser')

Once we have the HTML content, we can extract the table element using the find and find_all methods. Here is the code to extract the table element:

Python
1quotes = soup.find("table")
2rows = quotes.find_all("tr")[1:-1]

Notice, we are using the find method to get the table element and the find_all method to get all the rows in the table. We are using slicing to exclude the first and last rows, which are headers and footers, respectively.

Extracting Individual Cell Data

Next, we’ll loop through the rows and extract individual cell data. We also need to handle rows with nested elements, such as tags within rows. Here is the code to handle this:

Python
1for i in range(0, len(rows), 2):
2    quote = rows[i]
3    tags_row = quote.find_next_sibling()
4    tags = tags_row.find_all("a") if tags_row else []
5
6    print("Quote: ", quote.text)
7    for tag in tags:
8        print("Tag: ", tag.text)

In the code, we take the first two quotes and their tags. We then print the quote and tags for each quote - notice that the information for one quote is stored in 2 rows in the table. The i-th row contains the quote, and the i+1-th row contains the tags for that quote, that's why we are iterating over the rows with a step of 2. We use the find_next_sibling method to get the next row in the table that contains the tags, which are stored in anchor tags (<a>). We then extract the text from the anchor tags and print them.

The output of the above code will be:

Plain text
1Quote:  “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” Author: Albert Einstein
2Tag:  change
3Tag:  deep-thoughts
4Tag:  thinking
5Tag:  world
6Quote:  “It is our choices, Harry, that show what we truly are, far more than our abilities.” Author: J.K. Rowling
7Tag:  abilities
8Tag:  choices
9...

This output demonstrates the successful extraction and formatting of quotes and tags from the HTML table on the targeted website. By processing the structure as illustrated, we have efficiently consolidated valuable insights from nested HTML elements.

Lesson Summary

In this lesson, you learned how to scrape and process data within HTML tables using Python and Beautiful Soup. We covered the structure of HTML tables, extracting table elements, handling row data, and printing the extracted data. By mastering these skills, you are now equipped to scrape structured data from web pages effectively.

It's time to put your skills to the test with a hands-on exercise. Let's get started!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.