Welcome! In this lesson, we will dive into the world of web scraping, specifically focusing on scraping HTML lists. Let's start with a brief introduction to HTML lists and their significance in web scraping.
HTML lists are used to display a series of items in a structured manner. Broadly, there are two types of lists:
- Ordered Lists (
<ol>
): These lists are numbered (e.g., 1, 2, 3). - Unordered Lists (
<ul>
): These lists are bulleted (e.g., •, •, •).
Each item in these lists is enclosed within <li>
tags. Lists are commonly found on web pages in forms like navigation menus, product listings, etc., making them ideal targets for web scraping.
Example of an ordered list:
HTML, XML1<ol> 2 <li>Item 1</li> 3 <li>Item 2</li> 4 <li>Item 3</li> 5</ol>
We start by importing the required libraries and fetching the HTML content of the webpage.
Python1from bs4 import BeautifulSoup 2import requests 3 4url = "https://books.toscrape.com/" 5response = requests.get(url) 6soup = BeautifulSoup(response.text, "html.parser")
Next, we use a CSS selector to identify the specific list containing the books: soup.select(".page_inner section ol li")
: Selects all <li>
elements that are descendants of .page_inner section ol
. With that, we loop through the selected items and extract the book titles:
Python1books_ordered_list = soup.select(".page_inner section ol li") 2 3for book in books_ordered_list: 4 title = book.select("article h3 a")[0]["title"] 5 print(title)
book.select("article h3 a")[0]
: Selects the<a>
tag inside the<h3>
of the<article>
tag. Note thatselect
returns a list, so we use[0]
to access the first element.book.select("article h3 a")[0]["title"]
: Extracts thetitle
attribute of the<a>
tag.print(title)
: Prints the extracted book title.
The output will display the titles of the books listed on the webpage as follows:
Plain text1A Light in the Attic 2Tipping the Velvet 3Soumission 4Sharp Objects 5Sapiens: A Brief History of Humankind 6The Requiem Red 7The Dirty Little Secrets of Getting Your Dream Job 8...
In this lesson on HTML lists, we explored the basics of HTML lists and their significance in web scraping. We also learned how to fetch a webpage, identify specific lists using CSS selectors, and extract information from the selected list items. This knowledge will be invaluable as we proceed with more advanced web scraping techniques.
Now, let's put this knowledge into practice with some hands-on exercises!