Machine learning! You’ve probably heard this term a lot. But what exactly is it? Think of it as teaching a computer to learn from data and make decisions or predictions based on that data. This is like teaching a child to recognize different objects by showing them examples.
In this lesson, our goal is to understand the basics of a machine learning project. We’ll generate data, visualize it, and understand the relationships within it.
Let’s start by generating some data. In real-life projects, the first step is to collect data, but we'll create synthetic (fake) data for our learning purposes using NumPy
.
Why random data? It simulates different scenarios and creates a controlled environment for learning. Don't worry, in the end of this course we will work with the real data as well.
We'll use NumPy
to generate areas of houses (in square feet) and their prices:
Python1import numpy as np 2 3# Set random seed for reproducibility 4np.random.seed(42) 5 6# Generate synthetic data 7num_samples = 100 8area = np.random.uniform(500, 3500, num_samples) # House area in square feet 9area = np.round(area, 2) # Round to 2 decimal places 10 11# Assume a linear relationship: price = base_price + (area * price_per_sqft) 12base_price = 50000 13price_per_sqft = 200 14noise = np.random.normal(0, 25000, num_samples) # Adding some noise 15price = base_price + (area * price_per_sqft) + noise 16price = np.round(price, 2) # Round to 2 decimal places 17 18# Display a few generated data points for verification 19print("Area (sq ft):", area[:5]) # Area (sq ft): [1623.62 3352.14 2695.98 2295.98 968.06] 20print("Price ($):", price[:5]) # Price ($): [376900.18 712952.82 591490.02 459506.78 238120.2]
Real-life example: Imagine you want to predict house prices in your neighborhood. The area of the house affects the price. We simulate this by creating a simple linear relationship but add noise to make it realistic.
Let's break down the data generation:
Generate House Areas: Creates 100 random house areas between 500 and 3500 square feet.
Define Price Relationship:
Calculate Prices: Computes the final prices based on the area, base price, price per square foot, and added noise.
This method creates a realistic dataset with variable house prices based on their areas.
Now that we have our data, we need to handle it. This is where Pandas
comes in handy. Pandas
provide a powerful data structure called a DataFrame
.
A DataFrame
is like a table in an Excel sheet. It helps us organize data in rows and columns, making it easy to manipulate and analyze.
Python1import pandas as pd 2 3# Create DataFrame 4data = pd.DataFrame({'Area': area, 'Price': price}) 5 6# Display first few rows of the dataset 7print(data.head())
Output:
1 Area Price 20 1623.62 376900.18 31 3352.14 712952.82 42 2695.98 591490.02 53 2295.98 459506.78 64 968.06 238120.20
To understand our data better, we need to visualize it. This means creating graphs to see patterns and relationships. We use Matplotlib
for this purpose.
Visualizing data is crucial because it helps us see trends, patterns, and outliers, guiding us in choosing the right algorithms and parameters.
Python1import matplotlib.pyplot as plt 2 3# Plot the data to visualize the relationship 4plt.scatter(data['Area'], data['Price'], alpha=0.5) 5plt.title('House Area vs. Price') 6plt.xlabel('Area (sq ft)') 7plt.ylabel('Price ($)') 8plt.grid() 9plt.show()
Here is the generated scatter plot showing the relationship between house area and price, with 'House Area vs. Price' title, and labeled axes:
Great job! Let’s recap what we learned today:
NumPy
.DataFrame
with Pandas
to handle and organize data.Matplotlib
.By visualizing our data, we gain insights into relationships within it. Understanding these relationships is key to building effective machine learning models.
Now it’s time for hands-on practice. You will create your synthetic data, construct a DataFrame
, and plot relationships to understand the data better. This hands-on practice will reinforce the concepts we covered and make you more comfortable with data manipulation and visualization before building your first machine learning model.
Let’s get started!