Lesson 3
Probability Distributions
Lesson Introduction

Hello! Today, we'll explore Probability Distributions, a key concept in statistics and machine learning. By the end of this lesson, you'll know what probability distributions are, why they're essential, and how to work with them in Python.

Probability distributions help us understand how data behaves and the likelihood of different outcomes. We use them in everyday tasks like predicting weather, recommending movies, and much more. Let's dive in and see how they work!

Understanding Probability Distributions

A Probability Distribution describes how values of a random variable are distributed. It tells us the chances of different outcomes. Imagine rolling a six-sided die. Each number (1 to 6) has an equal chance of appearing. That’s an example of a probability distribution!

Probability distributions are crucial because:

  • They help us understand data behavior.
  • They allow us to make predictions and decisions.
  • They are used in many fields like finance, medicine, and machine learning.
Normal Distribution: part 1

The Normal Distribution (or Gaussian Distribution) is one of the most important probability distributions. Many natural phenomena follow this distribution, like heights, IQ scores, and measurement errors. The Normal Distribution is a bell-shaped curve symmetrical around its mean (average) value.

The Normal Distribution is defined by two parameters:

  • Mean (μ\mu): The average of all values in the distribution.
  • Standard Deviation (σ\sigma): Measures how spread out the values are from the mean.

For example, let's say we measure the heights of adult men in a town. The mean height is 70 inches, and the standard deviation is 3 inches. Most men will be around 70 inches tall, with fewer being much shorter or taller.

Generating a Normal Distribution Sample in Python

We can generate a sample of data that follows a normal distribution in Python using the numpy library. Let's create a sample with 1,000 data points, where the mean (μ\mu) is 0 and the standard deviation (σ\sigma) is 1.

Python
1import numpy as np 2import matplotlib.pyplot as plt 3 4mu = 0 # mean 5sigma = 1 # standard deviation 6sample = np.random.normal(mu, sigma, 1000) # generate a sample of 1000 data points 7 8# Plot the sample 9plt.hist(sample, bins=30, density=True, alpha=0.6, color='g') 10plt.xlabel('Value') 11plt.ylabel('Frequency') 12plt.title('Histogram of Normal Distribution Sample') 13plt.show()

In this example:

  1. We set the mean to 0 and the standard deviation to 1.
  2. We generated a sample of 1,000 data points.
  3. We plotted a histogram to visualize the sample.

After we run this code, we will see the following picture:

Normal Distribution: part 2

If we increase the amount of data points and amount of bins (which is analogous to collecting more data), we will get a plot with a shape close to the iconic bell-curve:

The more data you analyze, the closer the distribution is to the curve, defined as :

f(xμ,σ2)=12πσ2exp((xμ)22σ2)f(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)

This function is the probability density function (PDF) for the normal distribution. Let's explore it!

Probability Density Function (PDF)

The Probability Density Function (PDF) describes the likelihood of a random variable taking a specific value in a very small interval around that point.

Here's how to calculate and visualize the PDF for our normal distribution:

Python
1from scipy.stats import norm 2import numpy as np 3import matplotlib.pyplot as plt 4 5mu = 0 6sigma = 1 7 8# Calculate PDF for a range of values 9x = np.linspace(-3*sigma, 3*sigma, 1000) 10pdf = norm.pdf(x, mu, sigma) 11 12# Plot the PDF 13plt.plot(x, pdf, 'b-', lw=2) 14plt.xlabel('Value') 15plt.ylabel('Probability Density') 16plt.title('Probability Density Function (PDF) of Normal Distribution') 17plt.show()

In this example:

  1. We generated a range of values from 3σ-3\sigma to 3σ3\sigma.
  2. We calculated the PDF for each value.
  3. We plotted the PDF curve:

The PDF shows how the probability density varies with different values. For a normal distribution, values close to the mean are more likely.

Cumulative Distribution Function (CDF)

The Cumulative Distribution Function (CDF) gives the probability a random variable is less than or equal to a certain value. It accumulates the PDF values, providing a full picture of probabilities up to a point.

Here's how to calculate and visualize the CDF for our normal distribution:

Python
1from scipy.stats import norm 2import numpy as np 3import matplotlib.pyplot as plt 4 5mu = 0 6sigma = 1 7 8# Calculate PDF for a range of values 9x = np.linspace(-3*sigma, 3*sigma, 1000) 10# Calculate CDF for a range of values 11cdf = norm.cdf(x, mu, sigma) 12 13# Plot the CDF 14plt.plot(x, cdf, 'r-', lw=2) 15plt.xlabel('Value') 16plt.ylabel('Cumulative Probability') 17plt.title('Cumulative Distribution Function (CDF) of Normal Distribution') 18plt.show()

In this example:

  1. We used the same range of values.
  2. We calculated the CDF for each value.
  3. We plotted the CDF curve.

Let's see how we can use CDF for calculating probabilities.

Using CDF to Evaluate Probability: General Concept

Let's use the CDF to evaluate the probability of a random variable falling within a specific range. For our normal distribution example, we'll calculate the probability that a value is between -1 and 1.

The CDF gives us the cumulative probability that a random variable is less than or equal to a certain value. Essentially, it accumulates the probability from the left up to a given point.

For example, if we want to find the probability that our variable XX is less than or equal to aa (i.e., P(Xa)P(X \leq a)), we look at the CDF value at aa. The CDF value at aa tells us the total probability of all values up to aa.

Using CDF to Evaluate Probability: Example

Consider a normal distribution with a mean (μ\mu) of 0 and a standard deviation (σ\sigma) of 1. We want to calculate the probability that a value is between -1 and 1.

To do this, we:

  1. Find the CDF value at 1 (P(X1)P(X \leq 1)).
  2. Find the CDF value at -1 (P(X1)P(X \leq -1)).
  3. Subtract the CDF value at -1 from the CDF value at 1 to get the probability that the variable XX is between -1 and 1.

This calculation can be broken down into the following formula:

P(1X1)=P(X1)P(X1)P(-1 \leq X \leq 1) = P(X \leq 1) - P(X \leq -1)

Here is how we can visualize it:

At x=1x = 1 CDF is roughly 0.84, and at x=1x = -1 CDF is roughly 0.16. So, P(1X1)=0.840.16=0.68P(-1 \leq X \leq 1) = 0.84 - 0.16 = 0.68.

Lesson Summary

Great job! You've learned about probability distributions, focusing on the normal distribution. We covered generating a normal distribution sample in Python, and understanding concepts like PDF and CDF.

Next, it's time to practice. In the hands-on practice, you'll generate your distribution samples and calculate PDF and CDF values for different scenarios. This will solidify your understanding and prepare you for more complex tasks in machine learning. Let's get started!

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.