Lesson 2

Advanced Link Navigation and URL Management in Web Scraping

Topic Overview

Welcome! In this lesson, we'll delve into advanced link navigation and URL management within the realm of web scraping using Python and BeautifulSoup. Our goal is to ensure that you can navigate between linked web pages and manage URLs effectively for scalable web scraping.

Navigating Author Details

To solidify your understanding of link navigation, we'll focus on a scenario where you scrape quotes from a website and navigate to author pages to extract additional information. This process involves extracting links from the main page, navigating to the linked pages, and scraping data from those pages. The following code scrapes quote from the main page and navigate to the author pages for more information:

Python
1import requests 2from bs4 import BeautifulSoup 3 4def scrape_quotes(base_url): 5 response = requests.get(base_url) 6 soup = BeautifulSoup(response.text, 'html.parser') 7 8 quotes = soup.select('.quote') 9 10 for quote in quotes: 11 text = quote.select_one('.text').get_text() 12 author = quote.select_one('.author').get_text() 13 print(f'{text} - {author}') 14 15 endpoint_to_about_page = quote.select_one('span a')['href'] 16 url_to_about_page = base_url + endpoint_to_about_page 17 18 response = requests.get(url_to_about_page) 19 soup_about = BeautifulSoup(response.text, 'html.parser') 20 born_date = soup_about.select_one('.author-born-date').get_text() 21 born_location = soup_about.select_one('.author-born-location').get_text() 22 print(f'{author} was born on {born_date} in {born_location}\n') 23 24base_url = 'http://quotes.toscrape.com' 25scrape_quotes(base_url)
  1. First, we import the necessary libraries and define a soup object for the main page.

  2. Then, we extract quotes from the main page and iterate over each quote to extract text and author information.

  3. After that, we extract the endpoint to the author's page and construct the full URL:

    Python
    1endpoint_to_about_page = quote.select_one('span a')['href'] 2url_to_about_page = base_url + endpoint_to_about_page

    Remember that select_one() returns the first matching element, and we use the ['href'] attribute to extract the endpoint.

  4. Once we have the full URL, we send a request to the author's page and create a new soup object to extract additional information:

    Python
    1response = requests.get(url_to_about_page) 2soup_about = BeautifulSoup(response.text, 'html.parser') 3born_date = soup_about.select_one('.author-born-date').get_text() 4born_location = soup_about.select_one('.author-born-location').get_text() 5print(f'{author} was born on {born_date} in {born_location}\n')

    Notice, that in this snippet as well, we use select_one() to extract the birth date and location of the author.

The output of the code will be the following:

Plain text
1“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” - Albert Einstein 2Albert Einstein was born on March 14, 1879 in in Ulm, Germany 3 4“It is our choices, Harry, that show what we truly are, far more than our abilities.” - J.K. Rowling 5J.K. Rowling was born on July 31, 1965 in in Yate, South Gloucestershire, England, The United Kingdom 6 7“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” - Albert Einstein 8Albert Einstein was born on March 14, 1879 in in Ulm, Germany 9 10“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.” - Jane Austen 11Jane Austen was born on December 16, 1775 in in Steventon Rectory, Hampshire, The United Kingdom 12...
Lesson Summary and Practice

In this lesson, we've covered advanced link navigation and URL management in web scraping using Python and BeautifulSoup. We examined and extracted links, navigated between pages, handled relative and absolute URLs, and applied these concepts in a detailed code example. These skills will enable you to handle more complex web scraping tasks effectively.

These exercises will help you practice and deepen your understanding of link navigation and URL management in web scraping, enhancing your proficiency in scalable scraping projects. Happy Scraping!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.