Mastering PCA: Interpretation and Application in Machine Learning

Lesson 4

Introduction and Overview

Welcome to our exploration into Interpreting Principal Component Analysis (PCA) Results and its Application in Machine Learning. Today, we will first generate a synthetic dataset that has features inherently influenced by various factors built in. Next, we will computationally implement PCA and explore variable interactions. We will then compare the performance between models trained using the original features and the principal components derived from PCA. Let's dive right in!

Benefits of Integrating PCA-reduced data into ML models

Incorporating PCA-reduced data into Machine Learning models can significantly enhance our model's efficiency and lessen the issue of overfitting. PCA aids in reducing dimensionality without losing much information. This feature becomes increasingly useful when we deal with real-life datasets which have numerous attributes or features.

Synthetic Dataset Generation

Our first step is the creation of a synthetic dataset, which consists of several numeric features that naturally influence each other. The purpose of including these dependencies is to later determine if PCA can detect these implicit relationships among the features.

Python
1import numpy as np
2import pandas as pd
3np.random.seed(42) # Set random seed for reproducibility
4
5# Number of samples
6n_samples = 1000
7
8# Generate features
9tenure = np.random.normal(24, 6, n_samples).astype(int)  # Average tenure of 24 months
10monthly_charges = np.random.normal(70, 12, n_samples)  # Average monthly charge of $70
11data_usage = np.random.normal(20, 5, n_samples)  # Average data usage of 20 GB
12monthly_calls = 100 + 2 * tenure + 0.5 * data_usage  # More calls with higher tenure and data usage
13customer_satisfaction = np.random.randint(1, 11, n_samples)  # Satisfaction scores from 1 to 10
14
15# Derived correlated features
16total_charges = monthly_charges * tenure
17age_of_account = tenure + np.random.normal(0, 1, n_samples)  # Very similar to tenure
18
19# Binary target variable 'Churn' - arbitrary function influenced by different factors
20churn = (tenure < 12) | (monthly_charges > 100) | (data_usage > 30) | (customer_satisfaction < 4)

Now, let's put our data into a Pandas data frame:

Python
1# Create DataFrame
2df = pd.DataFrame({
3    'Monthly Charges': monthly_charges,
4    'Total Charges': total_charges,
5    'Tenure': tenure,
6    'Data Usage': data_usage,
7    'Monthly Calls': monthly_calls,
8    'Age of Account': age_of_account,
9    'Customer Satisfaction': customer_satisfaction,
10    'Churn': churn
11})
12
13# Take the data and target values for Logistic Regression
14data = df.copy()
15target = data.pop('Churn')

This portion of the code generates random variables to simulate typical customer usage data. This includes usage facts such as monthly_charges, monthly_calls, and data_usage, and a binary variable churn is influenced by these features. All this data is assembled together in a DataFrame.

Preparation for PCA and Data Split

Before we can proceed to the PCA, it's necessary to scale our features using Standard Scaler. Additionally, we also need to perform a train-test split of our data.

Python
1from sklearn.preprocessing import StandardScaler
2from sklearn.model_selection import train_test_split
3
4# Scale the features
5scaler = StandardScaler()
6data_scaled = scaler.fit_transform(data)
7
8# Split the data
9X_train, X_test, y_train, y_test = train_test_split(data_scaled, target, test_size=0.2, random_state=42)

Data scaling is necessary for PCA because it is a variance maximizing exercise. It projects your original data onto directions which maximize the variance. Thus we need to scale our data so that each feature has a unit variance.

Applying PCA

With the data prepared, let's apply PCA and evaluate its results.

Python
1from sklearn.decomposition import PCA
2
3# Apply PCA
4pca = PCA()
5X_train_pca = pca.fit(X_train)
6
7# Explained variance ratio
8explained_variance = pca.explained_variance_ratio_

In the code, PCA has been applied to the scaled training data, and the explained variance ratio is computed.

We can visualize the explained variance ratio using a scree plot:

Python
1import matplotlib.pyplot as plt
2
3# Scree plot
4plt.figure(figsize=(10, 6))
5plt.plot(range(1, len(explained_variance) + 1), explained_variance, marker='o', linestyle='--')
6plt.title('Explained Variance Ratio by Components')
7plt.xlabel('Number of Components')
8plt.ylabel('Explained Variance Ratio')
9plt.grid()
10plt.show()

This code generates a scree plot that shows the explained variance ratio by the number of components:

Deciding on the number of components to retain

Now, let's plot the cumulative explained variance by an increasing number of principal components and decide on the number to retain.

Python
1# Cumulative variance explains
2cumulative_variance = np.cumsum(explained_variance)
3
4# Decide n_components based on 95% threshold
5n_components = np.argmax(cumulative_variance >= 0.95) + 1
6
7print("Number of components to retain for at least 95% variance:", n_components) # Prints 4

This part of the code calculates the number of principal components needed to retain at least 95% of the original data's variance.

Model Training and Evaluation with and without PCA

Finally, we will train Logistic Regression models on both sets of data, and compare.

Python
1from sklearn.linear_model import LogisticRegression
2from sklearn.metrics import accuracy_score
3
4# Having that let's transform the data with PCA
5X_train_pca = pca.transform(X_train)
6X_test_pca = pca.transform(X_test)
7
8# Train a logistic regression model with PCA
9model = LogisticRegression()
10model.fit(X_train_pca, y_train)
11
12# Predict the test set
13y_pred = model.predict(X_test_pca)
14
15# Evaluate the model with PCA
16accuracy = accuracy_score(y_test, y_pred)
17print(f'Accuracy with PCA: {accuracy:.2f}') # Prints 0.94

The accuracy of a model trained on PCA-transformed data is computed.

Python
1# Train a logistic regression model without PCA
2model2 = LogisticRegression()
3model2.fit(X_train, y_train)
4y_pred2 = model2.predict(X_test)
5accuracy2 = accuracy_score(y_test, y_pred2)
6print(f'Accuracy without PCA: {accuracy2:.2f}') # Prints 0.94

The accuracy of a model trained on the original data without PCA transformation is also calculated for comparison. Notice that both models have the same accuracy score of 0.94, indicating that PCA did not affect the model's performance but reduced the number of features, therefore simplifying the model.

Conclusion

We have successfully covered creating a synthetic dataset, preparing the data, implementing PCA, determining the number of principal components to retain, and comparing accuracies of models trained with and without PCA. In the next lesson, we'll be delving deeper into PCA and other dimensionality reduction techniques. Happy learning!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.