Hello and welcome to this new lesson! Today, we will build upon your existing knowledge of Python's requests library and learn more about the responses we get when making HTTP requests. Specifically, we'll be inspecting response headers.
Whenever you make an HTTP request, the server does not only send back the requested content, but also some metadata related to that content. This metadata is conveyed through HTTP headers, which come as key-value pairs.
In simpler terms, if HTTP were a mailing system, headers would be similar to the information you find on the outside of the mail envelope - who it's from, where it's going, the date it was sent, and so on. HTTP headers consist of information like what type of content it's sending, how to decode it, when was the last time it was modified, and more!
Let's use Python's requests
library to see some of these headers in action, using the solution code as an example.
Python1url = 'http://quotes.toscrape.com' # We'll scrape quotes from this webpage 2response = requests.get(url)
First, we make an HTTP GET request to our target URL, which gives us a Response
object. One of the properties of this object is headers
, which is a dictionary-like object of all response headers.
We can print these headers like this:
Python1if response.ok: 2 print("Response headers:") 3 for header, value in response.headers.items(): 4 print(f'{header}: {value}')
This code prints each header along with its corresponding value. Let's run this and see what we get!
The output of the above code will be:
Plain text1Response headers: 2Date: Tue, 07 May 2024 18:28:19 GMT 3Content-Type: text/html; charset=utf-8 4Content-Length: 11054 5Connection: keep-alive
This output summarises key information from the response headers, including when the response was sent (Date
), what the content type is (Content-Type
), how big the content is in bytes (Content-Length
), and the connection status (Connection
). Such details are crucial for understanding how to handle the received data in web scraping or API calls.
When you run the previous code, you'll probably come across many headers. Here are a few important ones which come up frequently:
Server
: The software used by the originating server.Date
: The date and time when the message was sent.Content-Type
: The MIME type of the returned content. This could betext/html
,application/json
,image/jpeg
, and so on. This tells the client what the content is and how to open it.Content-Length
: The size, in bytes, of the returned content.Connection
: Options desired by the client for the connection.
These headers provide additional insights into the server and the response content, and they can be quite useful in some cases!
Now, why is all this important for web scraping? Let's dig a bit deeper.
As a web scraper, your main goal is to extract useful data from web pages. However, scraping is not just about making requests and parsing HTML. You also need to ensure that your scraper behaves well and follows the rules set by the server. The server's responses, including headers, are a critical source of feedback for your scraper, containing valuable information about what the server allows or expects you to do.
For instance, an important header in web scraping is Content-Type
, which can help you determine the format of the returned content. If the Content-Type
is application/json
, you can use response.json()
to parse the content as a JSON object. Knowing this can greatly shape how your web scraping code is structured.
Well, our journey for this lesson stops here. We learned about HTTP response headers and how to inspect them using Python's requests
library. Keep practicing and experimenting with different websites to further strengthen your understanding of this important aspect of HTTP!
Happy coding!