Scraping HTML Lists with Beautiful Soup

Lesson 3

Introduction to HTML Lists

Welcome! In this lesson, we will dive into the world of web scraping, specifically focusing on scraping HTML lists. Let's start with a brief introduction to HTML lists and their significance in web scraping.

HTML Lists Overview

HTML lists are used to display a series of items in a structured manner. Broadly, there are two types of lists:

Ordered Lists (<ol>): These lists are numbered (e.g., 1, 2, 3).
Unordered Lists (<ul>): These lists are bulleted (e.g., •, •, •).

Each item in these lists is enclosed within <li> tags. Lists are commonly found on web pages in forms like navigation menus, product listings, etc., making them ideal targets for web scraping.

Example of an ordered list:

HTML, XML
1<ol>
2    <li>Item 1</li>
3    <li>Item 2</li>
4    <li>Item 3</li>
5</ol>

Loading the Libraries and Fetching the Webpage

We start by importing the required libraries and fetching the HTML content of the webpage.

Python
1from bs4 import BeautifulSoup
2import requests
3
4url = "https://books.toscrape.com/"
5response = requests.get(url)
6soup = BeautifulSoup(response.text, "html.parser")

Next, we use a CSS selector to identify the specific list containing the books: soup.select(".page_inner section ol li"): Selects all <li> elements that are descendants of .page_inner section ol. With that, we loop through the selected items and extract the book titles:

Python
1books_ordered_list = soup.select(".page_inner section ol li")
2
3for book in books_ordered_list:
4    title = book.select("article h3 a")[0]["title"]
5    print(title)

book.select("article h3 a")[0]: Selects the <a> tag inside the <h3> of the <article> tag. Note that select returns a list, so we use [0] to access the first element.
book.select("article h3 a")[0]["title"]: Extracts the title attribute of the <a> tag.
print(title): Prints the extracted book title.

The output will display the titles of the books listed on the webpage as follows:

Plain text
1A Light in the Attic
2Tipping the Velvet
3Soumission
4Sharp Objects
5Sapiens: A Brief History of Humankind
6The Requiem Red
7The Dirty Little Secrets of Getting Your Dream Job
8...

Summary

In this lesson on HTML lists, we explored the basics of HTML lists and their significance in web scraping. We also learned how to fetch a webpage, identify specific lists using CSS selectors, and extract information from the selected list items. This knowledge will be invaluable as we proceed with more advanced web scraping techniques.

Now, let's put this knowledge into practice with some hands-on exercises!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.