Hello and welcome! In this lesson, we'll be focusing on Structured Data Extraction and Storage. Specifically, we'll use Python
, along with BeautifulSoup
and Pandas
, to scrape data from web pages and store it in a CSV file. This process involves retrieving HTML content, parsing it to extract data, handling pagination, and finally saving the structured data.
When scraping data from web pages, it's essential to store the extracted data in a structured format for further analysis. One common way to store structured data is by using a CSV (Comma-Separated Values) file. CSV files are easy to create, read, and share, making them a popular choice for storing tabular data. Here is an example of a CSV file:
csv1actor,character,movie 2Tom Hanks,Forrest Gump,Forrest Gump 3Leonardo DiCaprio,Dominick Cobb,Inception
Pandas
is a powerful library in Python for data manipulation and analysis. It provides data structures like DataFrame
and tools for reading and writing data in various formats, including CSV files. By using Pandas
, we can easily store structured data in a CSV file.
Here is an example of how to create a DataFrame
and save it to a CSV file using Pandas
:
Python1import pandas as pd 2 3data = { 4 'actor': ['Tom Hanks', 'Leonardo DiCaprio'], 5 'character': ['Forrest Gump', 'Dominick Cobb'], 6 'movie': ['Forrest Gump', 'Inception'] 7} 8 9df = pd.DataFrame(data) 10df.to_csv('actors.csv', index=False)
After running this code, a CSV file named actors.csv
will be created with the following content:
csv1actor,character,movie 2Tom Hanks,Forrest Gump,Forrest Gump 3Leonardo DiCaprio,Dominick Cobb,Inception
Now that we have an understanding of CSV files and the Pandas
library, let's move on to web scraping and data extraction.
Python1import pandas as pd 2import requests 3from bs4 import BeautifulSoup 4 5def extract_to_csv(base_url, start_page, filename): 6 all_quotes = [] 7 current_page = start_page 8 9 while current_page: 10 response = requests.get(f"{base_url}{current_page}") 11 soup = BeautifulSoup(response.text, 'html.parser') 12 13 quotes = soup.find_all('div', class_='quote') 14 for quote in quotes: 15 text = quote.find('span', class_='text').text 16 author = quote.find('small', class_='author').text 17 tags = [tag.text for tag in quote.find_all('a', class_='tag')] 18 all_quotes.append({"text": text, "author": author, "tags": tags}) 19 20 next_link = soup.find('li', class_='next') 21 current_page = next_link.find('a')['href'] if next_link else None 22 23 df = pd.DataFrame(all_quotes) 24 df.to_csv(filename, index=False) 25 print(f"Data saved to {filename}") 26 27base_url = 'http://quotes.toscrape.com' 28start_page = '/page/1/' 29filename = 'quotes.csv' 30extract_to_csv(base_url, start_page, filename)
In this code:
- We define the
extract_to_csv
function to handle the entire process.all_quotes
collects all the quotes from all the pages- We loop through each page, extract quotes in the format:
{"text": text, "author": author, "tags": tags}
, and append them toall_quotes
.
- We loop through each page, extract quotes in the format:
- The loop is controlled by the
current_page
variable, which is updated to the next page URL until there are no more pages.- The next page URL is extracted from the
li
element with the classnext
.
- The next page URL is extracted from the
pd.DataFrame(all_quotes)
creates a DataFrame.df.to_csv(filename, index=False)
saves the DataFrame to a CSV file.
The output of the above code will be the quotes.csv
file containing the extracted data in a structured format.
In this lesson, we covered the process of extracting structured data from web pages and storing it in a CSV file. We used Python
, BeautifulSoup
, and Pandas
to scrape quotes from a website and save them in a CSV file.
Make sure to practice this on your own and explore other web scraping projects to enhance your skills. Happy coding!