Introduction to Machine Learning

Introduction to Machine Learning with SciKit Learn

Training Your First Machine Learning Model from ScratchLesson 1

Lesson 1

Introduction to Machine Learning

Lesson Introduction

Machine learning! You’ve probably heard this term a lot. But what exactly is it? Think of it as teaching a computer to learn from data and make decisions or predictions based on that data. This is like teaching a child to recognize different objects by showing them examples.

In this lesson, our goal is to understand the basics of a machine learning project. We’ll generate data, visualize it, and understand the relationships within it.

Data Generation

Let’s start by generating some data. In real-life projects, the first step is to collect data, but we'll create synthetic (fake) data for our learning purposes using NumPy.

Why random data? It simulates different scenarios and creates a controlled environment for learning. Don't worry, in the end of this course we will work with the real data as well.

We'll use NumPy to generate areas of houses (in square feet) and their prices:

Python
1import numpy as np
2
3# Set random seed for reproducibility
4np.random.seed(42)
5
6# Generate synthetic data
7num_samples = 100
8area = np.random.uniform(500, 3500, num_samples)  # House area in square feet
9area = np.round(area, 2)  # Round to 2 decimal places
10
11# Assume a linear relationship: price = base_price + (area * price_per_sqft)
12base_price = 50000
13price_per_sqft = 200
14noise = np.random.normal(0, 25000, num_samples)  # Adding some noise
15price = base_price + (area * price_per_sqft) + noise
16price = np.round(price, 2)  # Round to 2 decimal places
17
18# Display a few generated data points for verification
19print("Area (sq ft):", area[:5])  # Area (sq ft): [1623.62 3352.14 2695.98 2295.98  968.06]
20print("Price ($):", price[:5])  # Price ($): [376900.18 712952.82 591490.02 459506.78 238120.2]

Real-life example: Imagine you want to predict house prices in your neighborhood. The area of the house affects the price. We simulate this by creating a simple linear relationship but add noise to make it realistic.

Let's break down the data generation:

Generate House Areas: Creates 100 random house areas between 500 and 3500 square feet.
Define Price Relationship:
- Base price: A constant starting price.
- Price per square foot: A fixed price per unit area.
- Noise: Adds variability to simulate real-world data.
Calculate Prices: Computes the final prices based on the area, base price, price per square foot, and added noise.

This method creates a realistic dataset with variable house prices based on their areas.

Creating a Data Structure

Now that we have our data, we need to handle it. This is where Pandas comes in handy. Pandas provide a powerful data structure called a DataFrame.

A DataFrame is like a table in an Excel sheet. It helps us organize data in rows and columns, making it easy to manipulate and analyze.

Python
1import pandas as pd
2
3# Create DataFrame
4data = pd.DataFrame({'Area': area, 'Price': price})
5
6# Display first few rows of the dataset
7print(data.head())

Output:


1      Area      Price
20  1623.62  376900.18
31  3352.14  712952.82
42  2695.98  591490.02
53  2295.98  459506.78
64   968.06  238120.20

Data Visualization

To understand our data better, we need to visualize it. This means creating graphs to see patterns and relationships. We use Matplotlib for this purpose.

Visualizing data is crucial because it helps us see trends, patterns, and outliers, guiding us in choosing the right algorithms and parameters.

Python
1import matplotlib.pyplot as plt
2
3# Plot the data to visualize the relationship
4plt.scatter(data['Area'], data['Price'], alpha=0.5)
5plt.title('House Area vs. Price')
6plt.xlabel('Area (sq ft)')
7plt.ylabel('Price ($)')
8plt.grid()
9plt.show()

Here is the generated scatter plot showing the relationship between house area and price, with 'House Area vs. Price' title, and labeled axes:

Lesson Summary

Great job! Let’s recap what we learned today:

Introduction to machine learning.
Generated synthetic data using NumPy.
Created a DataFrame with Pandas to handle and organize data.
Visualized our data using Matplotlib.

By visualizing our data, we gain insights into relationships within it. Understanding these relationships is key to building effective machine learning models.

Now it’s time for hands-on practice. You will create your synthetic data, construct a DataFrame, and plot relationships to understand the data better. This hands-on practice will reinforce the concepts we covered and make you more comfortable with data manipulation and visualization before building your first machine learning model.

Let’s get started!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.