Extracting and Saving Images from Web Pages

Lesson 4

Overview

Welcome to the lesson on extracting and saving images from web pages. In this lesson, you will learn how to use Python and the BeautifulSoup library to scrape images from web pages and save them locally. By the end of this lesson, you will have a solid understanding of the entire process, from making web requests to locating image elements and saving the images.

Making Web Requests and Parsing HTML

We start by fetching the HTML content of the website we want to scrape. In this case, we'll use https://books.toscrape.com/.

First, import the necessary libraries and make an HTTP GET request to the website and parse the HTML content using BeautifulSoup.

Python
1import requests
2from bs4 import BeautifulSoup
3
4url = 'https://books.toscrape.com/'
5response = requests.get(url)
6
7soup = BeautifulSoup(response.text, 'html.parser')

In this example, we fetch and parse the HTML content of the Books website.

Locating and Extracting Image URLs

With the parsed HTML content, use BeautifulSoup to locate image elements and extract their URLs from the src attribute:

Python
1images = soup.find_all('img')
2image_urls = [img['src'] for img in images]

We now have a list of image URLs extracted from the web page.

Downloading and Saving Images

Finally, we will download and save the extracted images to the local file system.

Let's first ensure the images directory exists and create it if it doesn't using the makedirs function from the os module:

Python
1import os
2os.makedirs('images', exist_ok=True)

Next we can iterate over the image URLs, send requests to each URL, and save the images:

Python
1for src in image_urls:
2    full_src = f"https://books.toscrape.com/{src}" if not src.startswith('http') else src
3    img_response = requests.get(full_src, stream=True)
4    if img_response.status_code == 200:
5        img_name = os.path.basename(src) # Extract the image name from the URL
6        with open(f"images/{img_name}", 'wb') as f:
7            for chunk in img_response.iter_content(1024):
8                f.write(chunk)
9        print(f"Saved {img_name}")

After running the code all the images will be saved in the images directory. Let's understand the code step by step:

We iterate over the image URLs extracted from the web page. Note, that we construct the full URL by prepending the base URL if the image URL is relative.
For each URL, we send an HTTP GET request to download the image.
If the request is successful (status code 200), we extract the image name from the URL and save the image to the images directory. Notice, that we save the image in binary mode ('wb') and write the image content in chunks of 1024 bytes, which is more memory-efficient for large files.
Finally, we print a message indicating that the image was saved.

Summary and Exercises

In this lesson, you learned how to extract and save images from web pages using Python and the BeautifulSoup library. You learned how to make web requests, parse HTML content, locate image elements, extract image URLs, and save images to the local file system.

Now it's time to practice what you've learned in the exercises. Good luck!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.