Welcome! Today, we're going to explore Scipy, a library in Python designed for advanced mathematical and statistical computing—an extension of Numpy. One of the most significant advantages of using a powerful computing tool like Scipy is its ability to tackle complex problems that require numerous calculations, a feature which is crucial in fields such as engineering and data science, or any discipline that relies heavily on data analysis. By the end of this lesson, you'll be introduced to various useful features in Scipy, which will serve as additional tools in your data analytics toolbox.
Scipy comes pre-installed in most CodeSignal IDEs. Let's import the stats
module, which provides numerous statistical functions:
Python1from scipy import stats
In statistics, distribution functions play a crucial role—they enable us to identify the probability of potential outcomes of a random event. For instance, in a dice game, the distribution function can inform us of the chances of rolling a six. As we need some data to explore Scipy, let's firstly look at one way of generating meaningful data sample. We can utilise numpy.random
module here:
Python1import numpy as np 2 3# Simulating temperature data for a year in a city 4temp_data = np.random.normal(loc=30, scale=10, size=365)
In this scenario, we generate an array of 365
values, which are normally distributed with mean=30
and std=10
. Note, that in numpy random, loc
stands for mean
, and scale
stands for std
.
Scipy offers more statistical functions than Numpy. We'll explore two: skewness and kurtosis. Skewness measures the asymmetry of a probability distribution around its mean, while kurtosis gauges how outlier-prone a distribution is. For instance, these metrics could help us understand unusual variations in a city's annual temperature data.
Python1data = np.random.normal(size=1000) 2 3# Compute skewness - a measure of data asymmetry 4data_skewness = stats.skew(data) 5 6# Compute kurtosis - a measure of data "tailedness" or outliers 7data_kurtosis = stats.kurtosis(data) 8 9print(f"Skewness: {data_skewness}\nKurtosis: {data_kurtosis}")
Look at the picture below. This graph showcases asymmetry in statistical distributions. A negative skewness (blue curve) indicates the left tail is longer or fatter than the right - showing more lower valued data. In contrast, a positive skewness (red curve) indicates a distribution where the right tail is longer or fatter - representing more higher valued data. Skewness helps identify the shape and direction of spread of our data.
The next plot gives us insight into the shape of a distribution's tail and peak. Underneath the blue curve is a normal distribution with a kurtosis of 0
, showcasing a relatively balanced distribution with no extreme values. The red curve, with a higher kurtosis (Laplace distribution), has a more pronounced or 'pointy' peak with heavier tails, indicating more extreme values in the dataset. Higher kurtosis can signify an exceptional event, such as a black swan event in finance.
Great job! Today, we became familiar with Scipy and its application in statistical computations. We learned how to access distribution functions in Scipy and what skewness and kurtosis mean. Continue practicing to build confidence in these skills and keep exploring the possibilities in the data world with Scipy. Exciting exercises are on the way! Happy analyzing!