Hello! Today, we'll explore Probability Distributions, a key concept in statistics and machine learning. By the end of this lesson, you'll know what probability distributions are, why they're essential, and how to work with them in Python.
Probability distributions help us understand how data behaves and the likelihood of different outcomes. We use them in everyday tasks like predicting weather, recommending movies, and much more. Let's dive in and see how they work!
A Probability Distribution describes how values of a random variable are distributed. It tells us the chances of different outcomes. Imagine rolling a six-sided die. Each number (1 to 6) has an equal chance of appearing. That’s an example of a probability distribution!
Probability distributions are crucial because:
The Normal Distribution (or Gaussian Distribution) is one of the most important probability distributions. Many natural phenomena follow this distribution, like heights, IQ scores, and measurement errors. The Normal Distribution is a bell-shaped curve symmetrical around its mean (average) value.
The Normal Distribution is defined by two parameters:
For example, let's say we measure the heights of adult men in a town. The mean height is 70 inches, and the standard deviation is 3 inches. Most men will be around 70 inches tall, with fewer being much shorter or taller.
We can generate a sample of data that follows a normal distribution in Python using the numpy
library. Let's create a sample with 1,000 data points, where the mean () is 0 and the standard deviation () is 1.
Python1import numpy as np 2import matplotlib.pyplot as plt 3 4mu = 0 # mean 5sigma = 1 # standard deviation 6sample = np.random.normal(mu, sigma, 1000) # generate a sample of 1000 data points 7 8# Plot the sample 9plt.hist(sample, bins=30, density=True, alpha=0.6, color='g') 10plt.xlabel('Value') 11plt.ylabel('Frequency') 12plt.title('Histogram of Normal Distribution Sample') 13plt.show()
In this example:
After we run this code, we will see the following picture:
If we increase the amount of data points and amount of bins (which is analogous to collecting more data), we will get a plot with a shape close to the iconic bell-curve:
The more data you analyze, the closer the distribution is to the curve, defined as :
This function is the probability density function (PDF) for the normal distribution. Let's explore it!
The Probability Density Function (PDF) describes the likelihood of a random variable taking a specific value in a very small interval around that point.
Here's how to calculate and visualize the PDF for our normal distribution:
Python1from scipy.stats import norm 2import numpy as np 3import matplotlib.pyplot as plt 4 5mu = 0 6sigma = 1 7 8# Calculate PDF for a range of values 9x = np.linspace(-3*sigma, 3*sigma, 1000) 10pdf = norm.pdf(x, mu, sigma) 11 12# Plot the PDF 13plt.plot(x, pdf, 'b-', lw=2) 14plt.xlabel('Value') 15plt.ylabel('Probability Density') 16plt.title('Probability Density Function (PDF) of Normal Distribution') 17plt.show()
In this example:
The PDF shows how the probability density varies with different values. For a normal distribution, values close to the mean are more likely.
The Cumulative Distribution Function (CDF) gives the probability a random variable is less than or equal to a certain value. It accumulates the PDF values, providing a full picture of probabilities up to a point.
Here's how to calculate and visualize the CDF for our normal distribution:
Python1from scipy.stats import norm 2import numpy as np 3import matplotlib.pyplot as plt 4 5mu = 0 6sigma = 1 7 8# Calculate PDF for a range of values 9x = np.linspace(-3*sigma, 3*sigma, 1000) 10# Calculate CDF for a range of values 11cdf = norm.cdf(x, mu, sigma) 12 13# Plot the CDF 14plt.plot(x, cdf, 'r-', lw=2) 15plt.xlabel('Value') 16plt.ylabel('Cumulative Probability') 17plt.title('Cumulative Distribution Function (CDF) of Normal Distribution') 18plt.show()
In this example:
Let's see how we can use CDF for calculating probabilities.
Let's use the CDF to evaluate the probability of a random variable falling within a specific range. For our normal distribution example, we'll calculate the probability that a value is between -1 and 1.
The CDF gives us the cumulative probability that a random variable is less than or equal to a certain value. Essentially, it accumulates the probability from the left up to a given point.
For example, if we want to find the probability that our variable is less than or equal to (i.e., ), we look at the CDF value at . The CDF value at tells us the total probability of all values up to .
Consider a normal distribution with a mean () of 0 and a standard deviation () of 1. We want to calculate the probability that a value is between -1 and 1.
To do this, we:
This calculation can be broken down into the following formula:
Here is how we can visualize it:
At CDF is roughly 0.84
, and at CDF is roughly 0.16
. So, .
Great job! You've learned about probability distributions, focusing on the normal distribution. We covered generating a normal distribution sample in Python, and understanding concepts like PDF and CDF.
Next, it's time to practice. In the hands-on practice, you'll generate your distribution samples and calculate PDF and CDF values for different scenarios. This will solidify your understanding and prepare you for more complex tasks in machine learning. Let's get started!