Welcome to our exploration into Interpreting Principal Component Analysis (PCA) Results and its Application in Machine Learning. Today, we will first generate a synthetic dataset that has features inherently influenced by various factors built in. Next, we will computationally implement PCA and explore variable interactions. We will then compare the performance between models trained using the original features and the principal components derived from PCA. Let's dive right in!
Incorporating PCA-reduced data into Machine Learning models can significantly enhance our model's efficiency and lessen the issue of overfitting. PCA aids in reducing dimensionality without losing much information. This feature becomes increasingly useful when we deal with real-life datasets which have numerous attributes or features.
Our first step is the creation of a synthetic dataset, which consists of several numeric features that naturally influence each other. The purpose of including these dependencies is to later determine if PCA can detect these implicit relationships among the features.
Python1import numpy as np 2import pandas as pd 3np.random.seed(42) # Set random seed for reproducibility 4 5# Number of samples 6n_samples = 1000 7 8# Generate features 9tenure = np.random.normal(24, 6, n_samples).astype(int) # Average tenure of 24 months 10monthly_charges = np.random.normal(70, 12, n_samples) # Average monthly charge of $70 11data_usage = np.random.normal(20, 5, n_samples) # Average data usage of 20 GB 12monthly_calls = 100 + 2 * tenure + 0.5 * data_usage # More calls with higher tenure and data usage 13customer_satisfaction = np.random.randint(1, 11, n_samples) # Satisfaction scores from 1 to 10 14 15# Derived correlated features 16total_charges = monthly_charges * tenure 17age_of_account = tenure + np.random.normal(0, 1, n_samples) # Very similar to tenure 18 19# Binary target variable 'Churn' - arbitrary function influenced by different factors 20churn = (tenure < 12) | (monthly_charges > 100) | (data_usage > 30) | (customer_satisfaction < 4)
Now, let's put our data into a Pandas data frame:
Python1# Create DataFrame 2df = pd.DataFrame({ 3 'Monthly Charges': monthly_charges, 4 'Total Charges': total_charges, 5 'Tenure': tenure, 6 'Data Usage': data_usage, 7 'Monthly Calls': monthly_calls, 8 'Age of Account': age_of_account, 9 'Customer Satisfaction': customer_satisfaction, 10 'Churn': churn 11}) 12 13# Take the data and target values for Logistic Regression 14data = df.copy() 15target = data.pop('Churn')
This portion of the code generates random variables to simulate typical customer usage data. This includes usage facts such as monthly_charges
, monthly_calls
, and data_usage
, and a binary variable churn
is influenced by these features. All this data is assembled together in a DataFrame.
Before we can proceed to the PCA, it's necessary to scale our features using Standard Scaler. Additionally, we also need to perform a train-test split of our data.
Python1from sklearn.preprocessing import StandardScaler 2from sklearn.model_selection import train_test_split 3 4# Scale the features 5scaler = StandardScaler() 6data_scaled = scaler.fit_transform(data) 7 8# Split the data 9X_train, X_test, y_train, y_test = train_test_split(data_scaled, target, test_size=0.2, random_state=42)
Data scaling is necessary for PCA because it is a variance maximizing exercise. It projects your original data onto directions which maximize the variance. Thus we need to scale our data so that each feature has a unit variance.
With the data prepared, let's apply PCA and evaluate its results.
Python1from sklearn.decomposition import PCA 2 3# Apply PCA 4pca = PCA() 5X_train_pca = pca.fit(X_train) 6 7# Explained variance ratio 8explained_variance = pca.explained_variance_ratio_
In the code, PCA has been applied to the scaled training data, and the explained variance ratio is computed.
We can visualize the explained variance ratio using a scree plot:
Python1import matplotlib.pyplot as plt 2 3# Scree plot 4plt.figure(figsize=(10, 6)) 5plt.plot(range(1, len(explained_variance) + 1), explained_variance, marker='o', linestyle='--') 6plt.title('Explained Variance Ratio by Components') 7plt.xlabel('Number of Components') 8plt.ylabel('Explained Variance Ratio') 9plt.grid() 10plt.show()
This code generates a scree plot that shows the explained variance ratio by the number of components:
Now, let's plot the cumulative explained variance by an increasing number of principal components and decide on the number to retain.
Python1# Cumulative variance explains 2cumulative_variance = np.cumsum(explained_variance) 3 4# Decide n_components based on 95% threshold 5n_components = np.argmax(cumulative_variance >= 0.95) + 1 6 7print("Number of components to retain for at least 95% variance:", n_components) # Prints 4
This part of the code calculates the number of principal components needed to retain at least 95% of the original data's variance.
Finally, we will train Logistic Regression models on both sets of data, and compare.
Python1from sklearn.linear_model import LogisticRegression 2from sklearn.metrics import accuracy_score 3 4# Having that let's transform the data with PCA 5X_train_pca = pca.transform(X_train) 6X_test_pca = pca.transform(X_test) 7 8# Train a logistic regression model with PCA 9model = LogisticRegression() 10model.fit(X_train_pca, y_train) 11 12# Predict the test set 13y_pred = model.predict(X_test_pca) 14 15# Evaluate the model with PCA 16accuracy = accuracy_score(y_test, y_pred) 17print(f'Accuracy with PCA: {accuracy:.2f}') # Prints 0.94
The accuracy of a model trained on PCA-transformed data is computed.
Python1# Train a logistic regression model without PCA 2model2 = LogisticRegression() 3model2.fit(X_train, y_train) 4y_pred2 = model2.predict(X_test) 5accuracy2 = accuracy_score(y_test, y_pred2) 6print(f'Accuracy without PCA: {accuracy2:.2f}') # Prints 0.94
The accuracy of a model trained on the original data without PCA transformation is also calculated for comparison. Notice that both models have the same accuracy score of 0.94, indicating that PCA did not affect the model's performance but reduced the number of features, therefore simplifying the model.
We have successfully covered creating a synthetic dataset, preparing the data, implementing PCA, determining the number of principal components to retain, and comparing accuracies of models trained with and without PCA. In the next lesson, we'll be delving deeper into PCA and other dimensionality reduction techniques. Happy learning!