Lesson 3
Structured Data Extraction and Storage with Python
Topic Overview

Hello and welcome! In this lesson, we'll be focusing on Structured Data Extraction and Storage. Specifically, we'll use Python, along with BeautifulSoup and Pandas, to scrape data from web pages and store it in a CSV file. This process involves retrieving HTML content, parsing it to extract data, handling pagination, and finally saving the structured data.

Introduction to CSV Files

When scraping data from web pages, it's essential to store the extracted data in a structured format for further analysis. One common way to store structured data is by using a CSV (Comma-Separated Values) file. CSV files are easy to create, read, and share, making them a popular choice for storing tabular data. Here is an example of a CSV file:

csv
1actor,character,movie 2Tom Hanks,Forrest Gump,Forrest Gump 3Leonardo DiCaprio,Dominick Cobb,Inception
Pandas Library

Pandas is a powerful library in Python for data manipulation and analysis. It provides data structures like DataFrame and tools for reading and writing data in various formats, including CSV files. By using Pandas, we can easily store structured data in a CSV file.

Here is an example of how to create a DataFrame and save it to a CSV file using Pandas:

Python
1import pandas as pd 2 3data = { 4 'actor': ['Tom Hanks', 'Leonardo DiCaprio'], 5 'character': ['Forrest Gump', 'Dominick Cobb'], 6 'movie': ['Forrest Gump', 'Inception'] 7} 8 9df = pd.DataFrame(data) 10df.to_csv('actors.csv', index=False)

After running this code, a CSV file named actors.csv will be created with the following content:

csv
1actor,character,movie 2Tom Hanks,Forrest Gump,Forrest Gump 3Leonardo DiCaprio,Dominick Cobb,Inception

Now that we have an understanding of CSV files and the Pandas library, let's move on to web scraping and data extraction.

Storing Scraped Data in a CSV File
Python
1import pandas as pd 2import requests 3from bs4 import BeautifulSoup 4 5def extract_to_csv(base_url, start_page, filename): 6 all_quotes = [] 7 current_page = start_page 8 9 while current_page: 10 response = requests.get(f"{base_url}{current_page}") 11 soup = BeautifulSoup(response.text, 'html.parser') 12 13 quotes = soup.find_all('div', class_='quote') 14 for quote in quotes: 15 text = quote.find('span', class_='text').text 16 author = quote.find('small', class_='author').text 17 tags = [tag.text for tag in quote.find_all('a', class_='tag')] 18 all_quotes.append({"text": text, "author": author, "tags": tags}) 19 20 next_link = soup.find('li', class_='next') 21 current_page = next_link.find('a')['href'] if next_link else None 22 23 df = pd.DataFrame(all_quotes) 24 df.to_csv(filename, index=False) 25 print(f"Data saved to {filename}") 26 27base_url = 'http://quotes.toscrape.com' 28start_page = '/page/1/' 29filename = 'quotes.csv' 30extract_to_csv(base_url, start_page, filename)

In this code:

  • We define the extract_to_csv function to handle the entire process.
    • all_quotes collects all the quotes from all the pages
      • We loop through each page, extract quotes in the format: {"text": text, "author": author, "tags": tags}, and append them to all_quotes.
    • The loop is controlled by the current_page variable, which is updated to the next page URL until there are no more pages.
      • The next page URL is extracted from the li element with the class next.
  • pd.DataFrame(all_quotes) creates a DataFrame.
  • df.to_csv(filename, index=False) saves the DataFrame to a CSV file.

The output of the above code will be the quotes.csv file containing the extracted data in a structured format.

Lesson Summary

In this lesson, we covered the process of extracting structured data from web pages and storing it in a CSV file. We used Python, BeautifulSoup, and Pandas to scrape quotes from a website and save them in a CSV file.

Make sure to practice this on your own and explore other web scraping projects to enhance your skills. Happy coding!

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.