Welcome to the lesson on extracting and saving images from web pages. In this lesson, you will learn how to use Python
and the BeautifulSoup
library to scrape images from web pages and save them locally. By the end of this lesson, you will have a solid understanding of the entire process, from making web requests to locating image elements and saving the images.
We start by fetching the HTML content of the website we want to scrape. In this case, we'll use https://books.toscrape.com/
.
First, import the necessary libraries and make an HTTP GET request to the website and parse the HTML content using BeautifulSoup
.
Python1import requests 2from bs4 import BeautifulSoup 3 4url = 'https://books.toscrape.com/' 5response = requests.get(url) 6 7soup = BeautifulSoup(response.text, 'html.parser')
In this example, we fetch and parse the HTML content of the Books website.
With the parsed HTML content, use BeautifulSoup to locate image elements and extract their URLs from the src
attribute:
Python1images = soup.find_all('img') 2image_urls = [img['src'] for img in images]
We now have a list of image URLs extracted from the web page.
Finally, we will download and save the extracted images to the local file system.
Let's first ensure the images
directory exists and create it if it doesn't using the makedirs
function from the os
module:
Python1import os 2os.makedirs('images', exist_ok=True)
Next we can iterate over the image URLs, send requests to each URL, and save the images:
Python1for src in image_urls: 2 full_src = f"https://books.toscrape.com/{src}" if not src.startswith('http') else src 3 img_response = requests.get(full_src, stream=True) 4 if img_response.status_code == 200: 5 img_name = os.path.basename(src) # Extract the image name from the URL 6 with open(f"images/{img_name}", 'wb') as f: 7 for chunk in img_response.iter_content(1024): 8 f.write(chunk) 9 print(f"Saved {img_name}")
After running the code all the images will be saved in the images
directory. Let's understand the code step by step:
images
directory. Notice, that we save the image in binary mode ('wb'
) and write the image content in chunks of 1024 bytes, which is more memory-efficient for large files.In this lesson, you learned how to extract and save images from web pages using Python
and the BeautifulSoup
library. You learned how to make web requests, parse HTML content, locate image elements, extract image URLs, and save images to the local file system.
Now it's time to practice what you've learned in the exercises. Good luck!