Hello and welcome! Today's lesson focuses on Feature Importance in Gradient Boosting Models. We will explore how to determine which features in our dataset are most influential in predicting Tesla ($TSLA
) stock prices. By understanding the importance of features, we can refine our models and make more informed trading decisions.
Before diving into feature importance, let's quickly revise the previous steps to ensure we have a solid foundation.
Data Preparation and Feature Engineering:
Python1import pandas as pd 2import datasets 3from sklearn.model_selection import train_test_split 4from sklearn.preprocessing import StandardScaler 5 6# Load TSLA dataset 7tesla = datasets.load_dataset('codesignal/tsla-historic-prices') 8tesla_df = pd.DataFrame(tesla['train']) 9 10# Convert Date column to datetime type 11tesla_df['Date'] = pd.to_datetime(tesla_df['Date']) 12 13# Feature Engineering: adding technical indicators as features 14tesla_df['SMA_5'] = tesla_df['Adj Close'].rolling(window=5).mean() 15tesla_df['SMA_10'] = tesla_df['Adj Close'].rolling(window=10).mean() 16tesla_df['EMA_5'] = tesla_df['Adj Close'].ewm(span=5, adjust=False).mean() 17tesla_df['EMA_10'] = tesla_df['Adj Close'].ewm(span=10, adjust=False).mean() 18 19# Drop NaN values created by moving averages 20tesla_df.dropna(inplace=True) 21 22# Select features and target 23features = tesla_df[['Open', 'High', 'Low', 'Close', 'Volume', 'SMA_5', 'SMA_10', 'EMA_5', 'EMA_10']].values 24target = tesla_df['Adj Close'].values 25 26# Splitting the dataset into training and testing sets 27X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.25, random_state=42) 28 29# Standardizing features 30scaler = StandardScaler() 31X_train = scaler.fit_transform(X_train) 32X_test = scaler.transform(X_test)
Model Training:
Python1from sklearn.ensemble import GradientBoostingRegressor 2 3# Instantiate and fit the model 4model = GradientBoostingRegressor(random_state=42) 5model.fit(X_train, y_train)
What is Feature Importance?
Feature importance refers to techniques that assign scores to input features based on their importance in predicting the target variable. In the context of a Gradient Boosting model, feature importance indicates how valuable each feature is in constructing the boosted decision trees.
Why is Feature Importance Useful?
Understanding feature importance helps:
- Identify and select the most influential features, potentially simplifying the model.
- Gain insights into the factors driving your predictions.
- Improve model interpretability and trustworthiness.
Once the Gradient Boosting model is trained, we can easily access the feature importances. Let's walk through the steps:
Python1# Compute feature importance 2feature_importance = model.feature_importances_ 3 4# Create a DataFrame for better visualization of feature names alongside their importance 5feature_names = ['Open', 'High', 'Low', 'Close', 'Volume', 'SMA_5', 'SMA_10', 'EMA_5', 'EMA_10'] 6feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importance}) 7 8# Sort features by importance 9feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False) 10 11# Print feature importances with names 12print("Feature importance:\n", feature_importance_df) 13# Output: 14# Feature importance: 15# Feature Importance 16# 3 Close 9.447889e-01 17# 1 High 3.668675e-02 18# 0 Open 9.142875e-03 19# 2 Low 8.464037e-03 20# 6 SMA_10 4.800413e-04 21# 7 EMA_5 2.992652e-04 22# 8 EMA_10 1.326235e-04 23# 5 SMA_5 5.195267e-06 24# 4 Volume 3.363300e-07
Here's what each step is doing:
model.feature_importances_
: Extracts the feature importance scores from the trained Gradient Boosting model.feature_names = [...]
: Defines a list of feature names for better readability.feature_importance_df = pd.DataFrame(...)
: Creates a DataFrame that links feature names with their respective importance scores.feature_importance_df.sort_values(...)
: Sorts the DataFrame by feature importance in descending order for better interpretation.
Visualizing the importance of features helps interpret the results more effectively. We'll use Matplotlib to create a bar chart:
Python1import matplotlib.pyplot as plt 2 3feature_importance_df = feature_importance_df.iloc[::-1] 4 5# Plotting feature importance 6plt.figure(figsize=(10,6)) 7plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance']) 8plt.title('Feature Importances') 9plt.xlabel('Importance') 10plt.ylabel('Feature') 11plt.show()
The plot of the above code is a bar chart visually indicating the significance of each feature, making it easier to distinguish the most influential features. This visualization is crucial for understanding how different features contribute to the model's predictions.
By examining the feature importance values and plot, you can determine which features have the most impact on the model's predictions. For instance, if Adj Close
heavily relies on SMA_10
and Close
, we know they are critical factors in the stock's movement.
Insights and Next Steps:
- Focus on Key Features: Emphasize the most important features in further analysis and model tuning.
- Feature Selection: Consider removing less important features to simplify the model.
- Model Interpretation: Use feature importance insights to explain model predictions to stakeholders.
In this lesson, you learned about the concept of feature importance in Gradient Boosting models and its practical application to predict Tesla ($TSLA
) stock prices. You computed feature importances, visualized them using a bar chart, and interpreted the results to gain actionable insights.
Understanding which features influence your model's predictions is crucial for refining your models and making informed trading decisions. Up next, practice these concepts to solidify your understanding and enhance your skillset in machine learning for financial trading.
Great job!