Lesson 2

Navigating the Web: Mastering HTTP Status Codes with Python Requests

Introduction

Welcome to today's lesson: Handling HTTP Status Codes with Python's requests library. HTTP status codes are fundamental to understanding the response from a web server, and they play an important role when we request data from the web. Whether performing an API call or implementing a web scraper, correct handling of status codes ensures the resilience and stability of your code. By the end of this lesson, you will have a firm grasp of what HTTP status codes are, and how to handle them using Python's requests module.

HTTP Status Codes: An Overview

HTTP status codes are three-digit numbers that the server sends back to the client (your script, in this case) to indicate the outcome of the data retrieval process. All HTTP responses are categorized into five classes:

  • 1xx: Informational
  • 2xx: Success — the most common being 200 OK
  • 3xx: Redirection e.g., 301 Moved Permanently
  • 4xx: Client errors — e.g., 404 Not Found and 403 Forbidden
  • 5xx: Server errors e.g., 500 Internal Server Error

Though there are many HTTP status codes, here are some common ones that you might come across when scraping the web:

  • 200 OK: The request was successful, and the server returned the requested resource.
  • 301 Moved Permanently: The requested URL has moved permanently, and the new URL is provided in the response.
  • 403 Forbidden: The client doesn't have permission to access the requested URL.
  • 404 Not Found: The server could not find the requested URL.
  • 500 Internal Server Error: The server encountered an internal error and was unable to complete the request.

Understanding and handling these status codes when we write our scraping code will allow us to create more robust and effective web scraping solutions.

Python's `requests` and Status Codes

In Python, we can use the popular requests library to send HTTP requests. Upon receiving a response from the server, requests provides us with a Response object, which contains the server's response to our request.

One attribute of the Response object is status_code, which allows us to examine the HTTP status code that the server returned. If the server successfully processed our request, the status_code will be 200. If the resource we requested wasn't found on the server, then the status_code will be 404.

Let's look at how we can make a GET request to a server and print the status code of the response:

Python
1import requests 2 3response = requests.get('http://example.com') 4print(response.status_code)

This will print 200, meaning that our request was successful.

In the example provided in the task, the code is checking whether the status code of the HTTP response is 404. It then prints an appropriate message based on the result:

Python
1import requests 2 3# Attempt to fetch webpage content 4url = 'http://quotes.toscrape.com/invalid' 5response = requests.get(url) 6 7if response.status_code == 404: 8 print("The requested page was not found.") 9else: 10 print("Content fetched successfully!")

The output of the above code will be:

Plain text
1The requested page was not found.

This output demonstrates how we can handle different HTTP status codes to interpret the server's response more effectively. It allows us to execute conditional code based on the outcome of our HTTP request, making our applications more robust and user-friendly.

Now, let's break down the code and understand it in detail. The requests.get(url) function sends a GET request to the specified URL. The server will then send back a response, which is stored in the response variable.

The if response.status_code == 404: line checks to see if the status code in the HTTP response is 404, which signifies that the requested resource was not found on the server.

If the status code is indeed 404, then the code block under the if statement will be executed, and the message "The requested page was not found." will be printed.

However, if the status code is anything other than 404, the code block under the else statement will be executed, and the message "Content fetched successfully!" will be printed.

Setting Timeouts for Requests

When making HTTP requests, it's crucial to set timeouts to prevent your application from hanging indefinitely if the server takes too long to respond. The requests library allows you to specify a timeout in seconds for your requests. If the server does not respond within the specified timeout period, a requests.exceptions.Timeout exception is raised.

Here's how you can set a timeout for a GET request:

Python
1import requests 2 3try: 4 response = requests.get('http://www.google.com:81/', timeout=4) # Timeout set to 4 seconds 5 print(response.status_code) 6except requests.exceptions.Timeout: 7 print("The request timed out.")

In this example, we set a timeout of 4 seconds. If the server does not respond within this time, the exception block is executed, and "The request timed out." is printed.

Setting timeouts is particularly useful for web scraping and API requests, where server responsiveness can vary. It ensures that your application remains responsive and can handle situations where the server takes too long to reply.

Lesson Summary and Practice

Fantastic job! We have successfully covered the basics of HTTP status codes and how we can handle them using Python's requests library. Properly handling these status codes is crucial for ensuring the stability and efficiency of your web scraping code.

Remember, knowledge is perfected through continuous practice. It's now time for us to put our newly gained knowledge into practice! In the next exercise, you will write your own Python code to fetch HTTP status codes from different web addresses. This will not only put your understanding to test but also make you comfortable with handling HTTP status codes in real-world applications. Happy coding!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.