Welcome to our lesson on Descriptive Statistics! In this lesson, we will learn how to summarize and describe important features of a data set. By the end, you'll know how to calculate measures like the mean
and standard deviation
, which are crucial for understanding data in machine learning and many other fields.
Descriptive statistics help us get a quick overview of large amounts of data. Imagine understanding the average test scores of students or the variability in their heights. Descriptive statistics provide the tools for this efficiently.
The mean is the average of a set of numbers. Imagine test scores: 80, 85, 90, 75, and 95. To find the mean:
- Add the scores:
- Divide by the number of scores:
The mean score is 85. It gives us the "central" value.
The standard deviation measures how spread out numbers are. For our test scores example:
- Find the mean score: 85.
- Subtract the mean from each score and square the result:
- Find the average of these squared differences:
- Take the square root:
A standard deviation of about 7.07 tells us the scores vary on average by 7.07 points from the mean. Low standard deviation means data points are close to the mean, while high indicates they are spread out.
The median is another measure of central tendency that represents the middle value of a data set when it is arranged in order. It is particularly useful when the data set contains outliers, as the median is not affected by extreme values.
For example, consider the test scores: 75, 80, 85, 90, and 95. To find the median:
- Arrange the scores in order: 75, 80, 85, 90, 95
- Find the middle score: 85
If the data set contains an even number of values, the median is the average of the two middle numbers.
For instance, with scores: 75, 80, 85, and 95:
- Arrange in order: 75, 80, 85, 95
- Find the middle scores: 80 and 85
- Calculate their average:
The median is useful in situations where the mean might be misleading due to outliers or skewed data distributions.
Let's see how to calculate these in Python using the NumPy
library.
Here's a code snippet to calculate the mean, standard deviation, and median for a list of data:
Python1# Calculating Mean, Standard Deviation, and Median 2import numpy as np 3 4data = [1.2, 2.3, 3.1, 4.5, 5.7] 5 6mean = np.mean(data) 7std_dev = np.std(data) 8median = np.median(data) 9 10print("Mean:", mean) 11print("Standard Deviation:", std_dev) 12print("Median:", median)
Plain text1Mean: 3.36 2Standard Deviation: 1.589465318904442 3Median: 3.1
Note that the mean here will be slightly different from 3.36
due to the computational error.
- Import NumPy: We start by importing the
NumPy
library for numerical operations. - Data Set: We create a list of data points:
[1.2, 2.3, 3.1, 4.5, 5.7]
. - Calculate Mean: Using
np.mean(data)
,NumPy
calculates the average of the data points. - Calculate Standard Deviation: Using
np.std(data)
,NumPy
calculates how much the data points vary from the mean. - Calculate Median: Using
np.median(data)
,NumPy
finds the middle value of the data set. - Print Results: We print the calculated mean, standard deviation, and median.
In this lesson, we learned about descriptive statistics, focusing on the mean
and standard deviation
. These are essential tools for summarizing and understanding data sets.
We also saw how to calculate these values using Python, making it easier to handle large data sets.
Now it's your turn to practice! In the next session, you'll be given data sets to calculate the mean and standard deviation. This will help you reinforce what you've learned and apply these concepts to real data. Let's get started!