In this lesson, we will delve into the specifics of scraping data within HTML tables using Python and the Beautiful Soup library. By the end of this lesson, you will be able to effectively extract structured data from HTML tables and handle related challenges. The goals of this lesson are:
- Understand the HTML table structure.
- Learn to extract table data using BeautifulSoup.
- Handle row data effectively.
- Print and format the extracted data.
Let's get started!
HTML tables are a widely used element in web development for displaying structured data. The basic structure of an HTML table is composed of the following tags:
<table>
: Defines a table.<tr>
(Table Row): Defines a row in a table.<th>
(Table Header): Defines a header cell in a table.<td>
(Table Data): Defines a standard cell in a table.
Here is an example of a simple HTML table:
HTML, XML1<table> 2 <tr> 3 <th>Author</th> 4 <th>Quote</th> 5 </tr> 6 <tr> 7 <td>Albert Einstein</td> 8 <td>"Life is like riding a bicycle. To keep your balance, you must keep moving."</td> 9 </tr> 10 <tr> 11 <td>Isaac Newton</td> 12 <td>"If I have seen further it is by standing on the shoulders of Giants."</td> 13 </tr> 14</table>
Now, let's start by fetching the webpage content and parsing it with BeautifulSoup.
Here’s how you can make an HTTP GET request and parse the HTML content:
Python1import requests 2from bs4 import BeautifulSoup 3 4url = 'http://quotes.toscrape.com/tableful' 5response = requests.get(url) 6soup = BeautifulSoup(response.text, 'html.parser')
Once we have the HTML content, we can extract the table element using the find
and find_all
methods. Here is the code to extract the table element:
Python1quotes = soup.find("table") 2rows = quotes.find_all("tr")[1:-1]
Notice, we are using the find
method to get the table element and the find_all
method to get all the rows in the table. We are using slicing to exclude the first and last rows, which are headers and footers, respectively.
Next, we’ll loop through the rows and extract individual cell data. We also need to handle rows with nested elements, such as tags within rows. Here is the code to handle this:
Python1for i in range(0, len(rows), 2): 2 quote = rows[i] 3 tags_row = quote.find_next_sibling() 4 tags = tags_row.find_all("a") if tags_row else [] 5 6 print("Quote: ", quote.text) 7 for tag in tags: 8 print("Tag: ", tag.text)
In the code, we take the first two quotes and their tags. We then print the quote and tags for each quote - notice that the information for one quote is stored in 2 rows in the table. The i-th
row contains the quote, and the i+1-th
row contains the tags for that quote, that's why we are iterating over the rows with a step of 2. We use the find_next_sibling
method to get the next row in the table that contains the tags, which are stored in anchor tags (<a>
). We then extract the text from the anchor tags and print them.
The output of the above code will be:
Plain text1Quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” Author: Albert Einstein 2Tag: change 3Tag: deep-thoughts 4Tag: thinking 5Tag: world 6Quote: “It is our choices, Harry, that show what we truly are, far more than our abilities.” Author: J.K. Rowling 7Tag: abilities 8Tag: choices 9...
This output demonstrates the successful extraction and formatting of quotes and tags from the HTML table on the targeted website. By processing the structure as illustrated, we have efficiently consolidated valuable insights from nested HTML elements.
In this lesson, you learned how to scrape and process data within HTML tables using Python and Beautiful Soup. We covered the structure of HTML tables, extracting table elements, handling row data, and printing the extracted data. By mastering these skills, you are now equipped to scrape structured data from web pages effectively.
It's time to put your skills to the test with a hands-on exercise. Let's get started!