Submitter Details¶
Name: Ranjana Rajendran
Email: ranjana.rajendran@gmail.com
Problem Statement: Predict tilt-erupt of volcano eruption and time duration till eruption
Note : Throughout this notebook, 'pressure' and 'tilt-erupt' have been used synonymously
SUMMARY OF THE STUDY¶
Result¶
For predicting tilt-erupt 2 Linear models gave best result
- Linear Regression model gave Rsquare value of 0.996039470506752
- RidgeCV model gave Rsquare model of 0.9973216655765884.
The 2 Linear models are robust to window sizes and overlap windows and consistently gave results above 0.94 for the task of predicting final tilt-erupt value. The models have been analyzed using residual analysis in this study.
For predicting time to eruption of volcano, a linear model and several ensemble methods showed good performance.
- RidgeCV model gave 0.9441159515751051.
- AdaBoostRegressor gave 0.9569821587366184.
- Random Forest gave 0.959427257586498
- GradientBoostRegressor gave 0.9549051234483795
Explanation of method¶
There are to regression tasks in this project. First one is to predict tilt-erupt value at time of explosion of a volcano and the second one is to predict time to explosion of a volcano. The methods used to build feature vectors for the 2 tasks are similar, but there is a difference. The observations from the sensors are split into sliding windows and features exracted from each window using tsfresh.extract_features. There is a poroperty of tsfresh.extract_features - it does a group by on the column_id and then extracts features over all windows in each group. For predicting the final tilt erupt, where the final target value is unique for each observation, we can leverage this aspect to our advantage to get a model which shows a high accuracy. As this groups by column_id, if we give obs_id as column_id, it will extract information from all windows of that observation and give a summary result which will be a single row in the feature vector. Features such as pressure_maximum, pressume_absolute_maximum ranked high in significance in this task as this was a summary of all windows in that observation and these features directly implied the target value. Utilizing this method, I found that the choice of model, their Rsquare values are even robust to different window sizes and overlap sizes as the function will give a summary result which will remain almost the same for several features.
For predicting time to eruption, column_id is a tuple (obs_id, first time stamp of window). This produced reasonable number of windows which gave meaningful features. This is summarized in the section for predicting time to eruption. I was able to get high Rsquare values in the range of 0.9x for this task for a window size of 300 and overlap of less than 100. For window size of 100, the Rsquare value reduced to 0.6x range.
Choice of Metric¶
The scorer used throughout this notebook is Rsquare value. For each model , the mean absolute error, mean squared error, mean absolute percentage error have been computed for each model evaluated. Mean squared log error couldn't be utilized the target value was often negative.
The reason for choosing Rsquare value: (excerpt from ChatGPT)
Emphasis on Deviations: Squaring the residuals gives more weight to larger deviations from the regression line. This is beneficial because it penalizes larger errors more heavily, providing a clearer picture of how well the model fits the data.
Ensuring Positivity: Squaring ensures that all residuals are positive, as the square of any real number is non-negative. This simplifies calculations and makes interpretation easier.
Differentiability: Squaring the residuals results in a smooth, differentiable function. This property is essential for many optimization algorithms used to minimize the sum of squared residuals, such as the ordinary least squares (OLS) method.
Statistical Basis: The sum of squared residuals is directly related to the variance of the dependent variable around the regression line. Minimizing this sum is equivalent to maximizing the likelihood function in the context of normal errors, which is a common assumption in linear regression.
Mathematical Convenience: Squaring the residuals leads to a straightforward mathematical formulation, making it easier to derive properties of the estimators and conduct statistical inference.
Overall, using the square of the residuals in linear regression provides a robust and statistically sound framework for fitting models to data and assessing their goodness of fit.
Test & Train Split of Dataset¶
Where applicable, sklearn's test_train_split function has been used. Howeever, for extrating features to create the feature vector for test and training set, a different method has been used. First 150 observations have been used for training and the remaining 39 used for testing. For running the notebook for a different window/overlap size, play the cell which will shuffle the list of observations, so that a different set of observations will be training and test set.
The reason for implementing the training and test set in this manner are as follows
- To ensure no data leakage. The test set has to be a completely unseen set during training. No window from the test set should be in training.
- The test set should be unbiased, ie , the test set should have an equal propertion of all phases of the volcano as there is in the training set. Providing 39 full observations for test set ensures that the test set does not have bias.
- tsfresh.extract_features could spill data between windows and yet produce different rows. Separating the training and test well in advance will avoid such data leakage.
Residual Analysis¶
Thorough residual analysis has been attempted in this project. Residuals and their density are plotted against the predicted values and predictors. QQPLots were made to confirm that the residuals follow a normal distribution. I did not have to make adjustments in the models to fix issues as the residual plots looked very good. The analysis of residual plots were made referring https://www.qualtrics.com/support/stats-iq/analyses/regression-guides/interpreting-residual-plots-improve-regression/
Different window/overlap sizes¶
The notebook can be made to run for different window and overlap values by editing the corresponding cells for the w and overlap sizes and then running the subsequent sections by clicking the play button and not making any code changes.
Load and extract the data¶
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#Import necessary model selection tools
from sklearn.model_selection import cross_val_score, RepeatedKFold, train_test_split, KFold
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
#Import necessary models
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.svm import LinearSVR
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PowerTransformer, StandardScaler
#Import necessary metrics
from sklearn.metrics import roc_auc_score, roc_curve, r2_score, mean_absolute_error, mean_squared_error, mean_squared_log_error, mean_absolute_percentage_error
#Simple decision tree
from sklearn.tree import DecisionTreeRegressor
# Simple bagged model
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
# Simple boosted model
from xgboost import XGBRegressor
from google.colab import drive
drive.mount('/content/drive')
path_prefix = '/content/drive/MyDrive/ML-SwitchUP/Mini Project/Volcano/'
import os
from glob import glob
dir = path_prefix + 'Volcano_Dataset/*/*'
fnames = glob(dir)
print(fnames)
# Read each observation as a 2 x n array in a list of arrays
arrays = [np.loadtxt(f, skiprows=14, delimiter = ',') for f in fnames]
EDA¶
Scatter plot of tilt-erupt vs time stamp¶
for x in arrays:
plt.plot(x[0, :], x[1, :])
!mkdir path_prefix + 'Volcano_Dataset_Transformed'
Investigate Yeo-Johnson power transformation¶
# Transformation applied separately for each observation to avoid data leakage between observations
from sklearn.preprocessing import PowerTransformer, power_transform
new_arrays = []
path = path_prefix + 'Volcano_Dataset_Transformed/obs'
for i, x in enumerate(arrays, 0):
x_transformed = power_transform(x, method = 'yeo-johnson', standardize = False, copy = True)
new_arrays.append(x_transformed)
pd.DataFrame(x_transformed).to_csv(path+str(i)+'.csv', index = False, header = False)
import os
from glob import glob
dir_t = path_prefix + 'Volcano_Dataset_Transformed/*'
fnames_t = glob(dir_t)
#print(fnames_t)
# Read each observation as a 2 x n array in a list of arrays
new_arrays = [np.loadtxt(f, delimiter = ',', dtype = np.float64) for f in fnames_t]
plt.xlabel('Time')
plt.ylabel('tilt-erupt')
for x in new_arrays:
plt.plot(x[0, :], x[1, :])
plt.show()
plt.subplot(1,2,1)
plt.xlabel('Time to eruption')
for x in new_arrays:
sns.kdeplot(x[0, :])
plt.subplots_adjust(wspace=0.4)
plt.subplot(1,2,2)
plt.xlabel('tilt-erupt')
for x in new_arrays:
sns.kdeplot(x[1, :])
plt.show()
This shows that most of the points are in the rest phase of the volcano and there are fewer points towards the time of eruption.
Investigate simply fitting to a polynomial curve¶
x = []
y = []
for item in new_arrays:
x.append(item[0,:].reshape(-1,1))
y.append(item[1,:].reshape(-1,1))
x = np.concatenate(x)
y = np.concatenate(y)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
# Initialize the scaler
scaler = StandardScaler()
x_train_s = np.ravel(scaler.fit_transform(x_train))
x_test_s = np.ravel(scaler.transform(x_test))
print(y_train.shape)
print(y_test.shape)
print(x_train_s.shape)
print(x_test_s.shape)
# Fit a polynomial curve
degree = 5 # Degree of the polynomial curve
coefficients = np.polyfit(x_train_s, y_train, degree) # Fitting the polynomial curve
fitted_curve = np.polyval(coefficients, x_test_s) # Evaluating the polynomial curve at given time points
# Plot the original data and the fitted curve
plt.scatter(x_train_s, y_train, label='Training Data', color = 'b')
plt.scatter(x_test_s, y_test, label='Test Data', color = 'g')
plt.scatter(x_test_s, fitted_curve, color='red', label='Fitted Curve (degree={})'.format(degree))
plt.xlabel('Time')
plt.ylabel('Tilt-Erupt')
plt.title('Fitting Polynomial Curve to Time Series Data')
plt.legend()
plt.grid(True)
plt.show()
print("r2 score of polynomial curve fitting: ",r2_score(fitted_curve,y_test))
print("Mean squared error of Linear Regression model: ",mean_squared_error(fitted_curve,y_test))
print("Mean absolute error of Linear Regression model: ",mean_absolute_error(fitted_curve,y_test))
print("Root mean squared error of Linear Regression model: ",np.sqrt(mean_squared_error(fitted_curve,y_test)))
Increasing or decreasing the degree of polynomial did not help to improve Rsquare value considerably
Linear Regression with Polynomial Features¶
#### Linear Regression with Polynomial Features
from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=degree)
x_poly = poly_features.fit_transform(x_train_s.reshape(-1, 1))
# Linear regression
model = LinearRegression()
model.fit(x_poly, y_train)
x_fit_poly = poly_features.transform(x_test_s.reshape(-1, 1))
y_fit = model.predict(x_fit_poly)
# Plot the original data and the fitted curve
plt.scatter(x_train_s, y_train, label='Training Data', color = 'b')
plt.scatter(x_test_s, y_test, label='Test Data', color = 'g')
plt.scatter(x_test_s, y_fit, color='red', label='Fitted Curve (degree={})'.format(degree))
plt.xlabel('Time')
plt.ylabel('Tilt-Erupt')
plt.title('Fitting Linear Regression with Polynomial Features to Time Series Data')
plt.legend()
plt.grid(True)
plt.show()
print("r2 score of LR with Poly features: ",r2_score(y_fit,y_test))
print("Mean squared error of LR with Poly features: ",mean_squared_error(y_fit,y_test))
print("Mean absolute error of LR with Poly features: ",mean_absolute_error(y_fit,y_test))
print("Root mean squared error of LR with poly features: ",np.sqrt(mean_squared_error(y_fit,y_test)))
# Linear regression
model = LinearRegression()
model.fit(x_train_s.reshape(-1,1), y_train)
y_predicted = model.predict(x_test_s.reshape(-1,1))
print("r2 score of LR with Poly features: ",r2_score(y_predicted,y_test))
print("Mean squared error of LR with Poly features: ",mean_squared_error(y_predicted,y_test))
print("Mean absolute error of LR with Poly features: ",mean_absolute_error(y_predicted,y_test))
print("Root mean squared error of LR with poly features: ",np.sqrt(mean_squared_error(y_predicted,y_test)))
We see that Linear Regression, or polynomial fitting , neither works on the data set as given above. For time series, we need to try extracting features from windows.
Install required libraries¶
! pip install tsfresh
!pip install lazypredict
Extract Features over a sliding window on Time Series Data¶
Windowing Technique: Each observation is divided into windows of size 'w' with an overlap of 'v', and features are extracted from each window. This allows for capturing temporal patterns within the data.
Robustness: This windowing technique is robust to different observation lengths because residual readings are appended as the last window after the penultimate window of size 'w'. This allows for flexibility in handling observations of varying lengths.
Two Tasks: There are two tasks: predicting the final 'tilt-erupt' value and predicting the 'time to eruption'.
Different Y Values: For predicting the final 'tilt-erupt' value, the 'y' value appended to each window DataFrame is the same for all windows from the same observation. However, for predicting the 'time to eruption', the 'y' value is the beginning timestamp of each window, allowing for predicting the time until eruption.
Feature Extraction: Features are extracted from the windows DataFrame using tsfresh.extract_features. This function performs a group by on 'column_id', which is defined differently for the two tasks. For predicting 'tilt-erupt', the 'column_id' is a tuple (obs_id, final tilt-erupt value), while for predicting 'time to eruption', it is a tuple (obs_id, first time stamp of window).
Data Leakage Prevention: To avoid data leakage between training and test sets, it's crucial to separate them well in advance, before windowing. This ensures that overlapping windows don't end up in both the training and test sets.
import math
import pandas as pd
from tsfresh.feature_extraction import ComprehensiveFCParameters,extract_features
from tsfresh.utilities.dataframe_functions import impute
from sklearn.model_selection import train_test_split
def computeFeatureVector(start_pos, end_pos, train_test_spec, w, overlap, ifptime) :
df_windows = []
count = 0
for observation_num, x in enumerate(new_arrays[start_pos:end_pos], start_pos):
df_obs = pd.DataFrame({'time': x[0, :].astype(float), 'Pressure': x[1, :], 'obs_id': observation_num})
window_size = w
window_hop = w - overlap
start_frame = w
end_frame = window_hop * math.floor(float(df_obs.shape[0]) / window_hop)
df_list = []
for frame_idx in range(start_frame, end_frame+1, window_hop):
df_list.append(df_obs.iloc[frame_idx-window_size:frame_idx, :])
df_list.append(df_obs.iloc[end_frame:df_obs.shape[0], :])
for df in df_list:
if (df.shape[0] == 0):
break;
df['y'] = x[1][x.shape[1]-1]
if (ifptime):
df['y'] = df.iloc[0][0]
df['obs_id'] = list(zip(df.obs_id, df.y))
pd.concat(df_list)
df_windows.append(pd.concat(df_list))
df_all_windows = pd.concat(df_windows)
print(df_all_windows.shape)
print(df_all_windows.head())
path_suffix=""
if (ifptime):
path_suffix = "ptime"
df_all_windows.to_csv(path_prefix + 'Windows' + str(w) + ',' + str(overlap) + path_suffix + train_test_spec+'.csv', index = False)
df_full_features = extract_features(df_all_windows.drop(['y'], axis = 1, errors = 'ignore'), column_id='obs_id', column_sort="time", impute_function=impute, default_fc_parameters=ComprehensiveFCParameters())
print(df_full_features.shape)
print(df_full_features.index)
df_full_features['y'] = [(a[1]) for a in df_full_features.index]
df_full_features['obs_id'] = [(a[0]) for a in df_full_features.index]
print(df_full_features.head())
df_full_features.to_csv(path_prefix + 'WindowsComprehensiveExtractedFeatures' + str(w) + ',' + str(overlap) + path_suffix +train_test_spec+'.csv', index = False)
Prediction of tilt-erupt at eruption¶
In this section we will analyze models derived for features extracted for different window and overlap sizes for predicting the final tilt-erupt value at volcano eruption time.
The feature extraction is done by tsfresh.extract_features which will group the dataset based on column_id. In this case the column_id is a tuple (obs_id, last tilt-erupt value for the observation). This tuple is the same value for all windows from the same observation. The feature vector extracted from a training set of 150 observation will hence have 150 rows, one row for each observation, as the windows were grouped by column_id which is the tuple (obs_id, last tilt-erupt value of the observation).
w = 100, overlap = 0¶
w = 100
ov = 0
# Shuffle the array of observations so that it is
# not always the same set of observations which are training and test
np.random.shuffle(new_arrays)
computeFeatureVector(0, 150, 'training', w, ov, False) # Observations 0 to 149 for training set
computeFeatureVector(150, 189, 'test', w, ov, False) # Observations 150 to 188 for test set
X_train_p = pd.read_csv(path_prefix + 'WindowsComprehensiveExtractedFeatures'+ str(w) + ',' + str(ov) + 'training.csv')
X_test_p = pd.read_csv(path_prefix + 'WindowsComprehensiveExtractedFeatures'+ str(w) + ',' + str(ov) +'test.csv')
pressures_train = X_train_p['y'] # Training target
pressures_test = X_test_p['y'] # Test target
X_train_p = X_train_p.drop(['y', 'obs_id'], axis = 1)
X_test_p = X_test_p.drop(['y', 'obs_id'], axis = 1)
column_names = X_train_p.columns
scaler = StandardScaler()
X_train_ps = pd.DataFrame(scaler.fit_transform(X_train_p), columns=column_names)
X_test_ps = pd.DataFrame(scaler.transform(X_test_p), columns=column_names)
print(X_train_ps.shape)
print(X_test_ps.shape)
print(pressures_train.shape)
print(pressures_test.shape)
from tsfresh import select_features
X_train_pressure = select_features(X_train_ps, pressures_train)
X_test_pressure = X_test_ps[X_train_pressure.columns]
print(X_train_pressure.shape)
print(X_test_pressure.shape)
#for x in X_train_pressure.columns:
#print(x)
LazyPredict¶
from lazypredict.Supervised import LazyRegressor
lpredict = LazyRegressor(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = lpredict.fit(X_train_pressure, X_test_pressure, pressures_train, pressures_test)
print(models)
Accordig to alaysis by LazyPredict, Linear models like RidgeCV, Linear Regression are suggested as the best models for this task. We will focus this study on Linear Regression, then Linear SVR, then we can try a Decision Tree, Random Forest and KNeighborsRegressor to see if we can infer anything useful.
Linear Models¶
Linear Regression [Good Model according to this study for predicting tilt-erupt]¶
from sklearn.linear_model import LinearRegression
model_lr = LinearRegression()
param_grid = {
'fit_intercept': [True, False], # Tuning fit_intercept
'copy_X': [True, False], # Tuning copy_X
'positive': [True, False], # Tuning positive
'n_jobs': [None, -1], # Tuning n_jobs
}
# Instantiate the GridSearchCV object
grid_search_lr = GridSearchCV(model_lr, param_grid, cv=5, scoring='neg_mean_squared_error')
# Fit the model to the training data
grid_search_lr.fit(X_train_pressure, pressures_train)
# Get the best model and its parameters
best_model_lr = grid_search_lr.best_estimator_
best_params_lr = grid_search_lr.best_params_
# Make predictions on the test data
prediction_lr = best_model_lr.predict(X_test_pressure)
print("best parameters:",best_params_lr)
print("r2 score of Linear Regression model: ",r2_score(prediction_lr,pressures_test))
print("Mean squared error of Linear Regression model: ",mean_squared_error(prediction_lr,pressures_test))
print("MAE score of Linear Regression model: ",mean_absolute_error(prediction_lr,pressures_test))
#print("MSLE score of Linear Regression model: ",mean_squared_log_error(prediction_lr,pressures_test))
print("MAPE score of Linear Regression model: ",mean_absolute_percentage_error(prediction_lr,pressures_test))
Evaluation of Linear regression best model¶
Residual Evaluation¶
### Calculate residual
prediction_lr_train = best_model_lr.predict(X_train_pressure)
train_residuals = pressures_train - prediction_lr_train
test_residuals = pressures_test - prediction_lr
Plot residuals, density of residuals vs predicted values¶
# Residual evaluation plots
plt.figure(figsize=(12, 6))
# Residuals vs Predictions plot
plt.subplot(2, 2, 1)
plt.scatter(pressures_test, test_residuals, c='green', marker='s', label='Test data')
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Values')
plt.axhline(y=0, color='red', linestyle='--')
plt.legend()
plt.subplot(2, 2, 2)
plt.scatter(pressures_train, train_residuals, c='blue', marker='o', label='Training data')
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Values')
plt.axhline(y=0, color='red', linestyle='--')
plt.legend()
# Residuals distribution plot
plt.subplot(2, 2, 3)
sns.histplot(test_residuals, bins = 20, color='green', label='Test residuals', kde = True)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')
plt.legend()
plt.subplot(2, 2, 4)
sns.histplot(train_residuals, bins = 20, color='blue', label='Training residuals', kde = True)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')
plt.legend()
plt.tight_layout()
plt.show()
For training data, the residuals follow a normal distribution relative to predicted values. For certain window / overlap sizes, in the test data, it looks like there are outliers, which is expected as test data is scaled to transform, not to fit. The trained model is hence capable of detecting outliers in unseen data.
QQ Plot of residuals¶
from cProfile import label
import statsmodels.api as sm
import scipy.stats as stats
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
sm.qqplot(test_residuals, stats.t, fit=True, line="45", ax=ax1)
ax1.set_xlabel('Theoretical Quantiles')
ax1.set_ylabel('Sample Quantiles')
ax1.set_title('Q-Q Plot of Test Residuals')
sm.qqplot(train_residuals, stats.t, fit=True, line="45", ax=ax2)
ax2.set_xlabel('Theoretical Quantiles')
ax2.set_ylabel('Sample Quantiles')
ax2.set_title('Q-Q Plot of Training Residuals')
plt.tight_layout()
plt.show()
The training residuals follow a normal distribution and plot along the 45 degree line. For the test residuals, it follows the 45 degree line but not exactly. RidgeCV , analyzed below, showed better QQ plot for test data set. Ridge CV has also showed good results in this analysis.
Plot residuals against the most significant feature¶
def find_most_important_feature(features,model):
features = features.columns
coefficients = best_model_lr.coef_
#most_significant_feature_index = abs(coefficients).argmax()
most_significant_feature_index = coefficients.argmax()
most_significant_feature = features[most_significant_feature_index]
print("Most significant feature:", most_significant_feature)
print("Coefficient:", coefficients[most_significant_feature_index])
return most_significant_feature
print('\n Most important feature for Linear Regression\n')
most_important_feature = find_most_important_feature(X_train_pressure,best_model_lr)
# Residual evaluation plots
plt.figure(figsize=(8, 4))
# Residuals vs Predictions plot
plt.subplot(1, 2, 1)
plt.scatter(X_test_pressure[most_important_feature], test_residuals, c='green', marker='s', label='Test data')
plt.xlabel(most_important_feature)
plt.ylabel('Residuals')
plt.title('Residuals vs most significant feature')
plt.legend()
plt.subplot(1, 2, 2)
plt.scatter(X_train_pressure[most_important_feature], train_residuals, c='blue', marker='o', label='Training data')
plt.xlabel(most_important_feature)
plt.ylabel('Residuals')
plt.title('Residuals vs most significant feature')
plt.legend()
plt.tight_layout()
plt.show()
This test is done to understand if there is any correlation between the residuals and the most significant feature. You can see that residuals are mostly concentrated around zero on X and Y axes and otherwise the residuals are randomly spread.
Test autocorrelation of residuals against time lags¶
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
sm.graphics.tsa.plot_acf(test_residuals, lags=20, ax = ax1)
ax1.set_xlabel('Lag')
ax1.set_ylabel('Autocorrelation')
ax1.set_title('Autocorrelation of Test Residuals')
sm.graphics.tsa.plot_acf(train_residuals, lags=50, ax = ax2)
ax2.set_xlabel('Lag')
ax2.set_ylabel('Autocorrelation')
ax2.set_title('Autocorrelation of Training Residuals')
plt.show()
The residuals are independent of each other across time lag.
Plot actual vs predicted values for training and test¶
In the residuals vs fitted data and residuals vs predictor plots we saw above, we saw an X axis imbalance for the training data set. This does not necessarily mean that the model has a problem. We should plot actual vs predicted data to see if there is a lot of deviation from the 45 degree line.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
ax1.scatter(prediction_lr, pressures_test, label='Test data')
#ax1.axline(xy1=(0, 0), xy2 =(1,1), color='r', lw=2)
ax1.set_xlabel('Predicted')
ax1.set_ylabel('Actual')
ax1.grid(True)
ax1.legend()
ax1.set_title('predicted vs actual')
ax2.plot(prediction_lr_train, pressures_train, 'o', label='Training data')
ax2.set_xlabel('Predicted')
ax2.set_ylabel('Actual')
ax2.set_title('predicted vs actual')
ax2.grid(True)
plt.legend()
plt.tight_layout()
plt.show()
The points seem to all align to a 45 degree line. If not we could have tuned the model. This proves that the X axis impbalance we saw previously can be ignored. It just showed that we had more points closer to zero predicted value.
Plot predicted values and actual values against most significant feature¶
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
Xvalues_test = X_test_pressure[most_important_feature]
ax1.scatter(Xvalues_test, pressures_test, color='black', linewidth = 3, label = 'test')
ax1.scatter(Xvalues_test, prediction_lr, color='green', label = 'prediction')
ax1.set_xlabel(most_important_feature)
ax1.set_ylabel('tilt-erupt')
ax1.set_title('Linear Regression from test data')
ax2.legend(loc="upper left")
Xvalues_train = X_train_pressure[most_important_feature]
ax2.scatter(Xvalues_train, pressures_train, color='black', linewidth = 3, label = 'training')
ax2.scatter(Xvalues_train, prediction_lr_train, color='blue', label = 'prediction')
ax2.set_xlabel(most_important_feature)
ax2.set_ylabel('tilt-erupt')
ax2.set_title('Linear Regression from test data')
ax2.legend(loc="upper left")
plt.tight_layout()
plt.show()
Evaluate Cross-validation error for training data¶
# evaluate the model
cv = RepeatedKFold(random_state=1)
n_scores_lr = cross_val_score(best_model_lr, X_train_pressure, pressures_train, scoring='r2', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Avg. error: %.3f (%.3f)' % (np.mean(n_scores_lr), np.std(n_scores_lr)))
sns.distplot(n_scores_lr , label = 'Best_linear_regression')
plt.legend()
plt.show()
Linear Support Vector Regression¶
from sklearn.svm import LinearSVR
model_lsvr = LinearSVR()
param_grid = {
'C': [0.1, 1, 10, 100],
'epsilon': [0.1, 0.2, 0.3, 0.4],
'fit_intercept': [True, False],
'loss': ['epsilon_insensitive', 'squared_epsilon_insensitive'],
'max_iter': [1000, 2000, 3000],
'tol': [1e-3, 1e-4, 1e-5],
'verbose': [False],
'random_state': [42],
'dual': [True, False]
}
# Instantiate the GridSearchCV object
grid_search_lsvr = GridSearchCV(model_lsvr, param_grid, cv=5, scoring='r2')
# Fit the model to the training data
grid_search_lsvr.fit(X_train_pressure, pressures_train)
# Get the best model and its parameters
best_model_lsvr = grid_search_lsvr.best_estimator_
best_params_lsvr = grid_search_lsvr.best_params_
# Make predictions on the test data
prediction_lsvr = best_model_lsvr.predict(X_test_pressure)
print("best parameters:",best_params_lsvr)
print("r2 score of Linear SV Regression model: ",r2_score(prediction_lsvr,pressures_test))
print("Mean squared error of Linear SV Regression model: ",mean_squared_error(prediction_lsvr,pressures_test))
Not investigating Linear SVR further as the test above did not produce good r2 value.
RidgeCV [As good as LR for predicting tilt-erupt]¶
model_rcv = RidgeCV(alphas=[0.1, 0.5, 1.0, 5.0, 10.0, 100, 150, 200], cv=5, scoring = 'r2', gcv_mode = 'eigen')
# Fit the model to the training data
model_rcv.fit(X_train_pressure, pressures_train)
# Make predictions on the test data
prediction_rcv = model_rcv.predict(X_test_pressure)
print("Best alpha:", model_rcv.alpha_)
print("r2 score of RidgeCV model: ",r2_score(prediction_rcv,pressures_test))
print("Mean squared error of RidgeCV model: ",mean_squared_error(prediction_rcv,pressures_test))
print("MAE score of Linear Regression model: ",mean_absolute_error(prediction_rcv,pressures_test))
print("MAPE score of Linear Regression model: ",mean_absolute_percentage_error(prediction_rcv,pressures_test))
RidgeCV is also a good model for this purpose and prduced similar result as Linear Regression and we could analyze this model further like we did LR. Here, we just plot the residuals to make sure its distribution looks good.
Evaluation of RidgeCV for predicting tilt-erupt at eruption time¶
Residual Evaluation¶
prediction_rcv_train = model_rcv.predict(X_train_pressure)
train_residuals_rcv = pressures_train - prediction_rcv_train
test_residuals_rcv = pressures_test - prediction_rcv
Plot residuals, density of residuals vs predicted values¶
# Residual evaluation plots
plt.figure(figsize=(12, 6))
# Residuals vs Predictions plot
plt.subplot(2, 2, 1)
plt.scatter(pressures_test, test_residuals_rcv, c='green', marker='s', label='Test data')
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Values')
plt.axhline(y=0, color='red', linestyle='--')
plt.legend()
plt.subplot(2, 2, 2)
plt.scatter(pressures_train, train_residuals_rcv, c='blue', marker='o', label='Training data')
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Values')
plt.axhline(y=0, color='red', linestyle='--')
plt.legend()
# Residuals distribution plot
plt.subplot(2, 2, 3)
sns.histplot(test_residuals_rcv, bins = 20, color='green', label='Test residuals', kde = True)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')
plt.legend()
plt.subplot(2, 2, 4)
sns.histplot(train_residuals_rcv, bins = 20, color='blue', label='Training residuals', kde = True)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')
plt.legend()
plt.tight_layout()
plt.show()
QQPlot¶
from cProfile import label
import statsmodels.api as sm
import scipy.stats as stats
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
sm.qqplot(test_residuals_rcv, stats.t, fit=True, line="45", ax=ax1)
ax1.set_xlabel('Theoretical Quantiles')
ax1.set_ylabel('Sample Quantiles')
ax1.set_title('Q-Q Plot of Test Residuals')
sm.qqplot(train_residuals_rcv, stats.t, fit=True, line="45", ax=ax2)
ax2.set_xlabel('Theoretical Quantiles')
ax2.set_ylabel('Sample Quantiles')
ax2.set_title('Q-Q Plot of Training Residuals')
plt.tight_layout()
plt.show()
Test autocorrelation of residuals against time lags¶
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
sm.graphics.tsa.plot_acf(test_residuals_rcv, lags=20, ax = ax1)
ax1.set_xlabel('Lag')
ax1.set_ylabel('Autocorrelation')
ax1.set_title('Autocorrelation of Test Residuals')
sm.graphics.tsa.plot_acf(train_residuals_rcv, lags=50, ax = ax2)
ax2.set_xlabel('Lag')
ax2.set_ylabel('Autocorrelation')
ax2.set_title('Autocorrelation of Training Residuals')
plt.show()
Cross-validation of training data¶
# evaluate the model
cv = RepeatedKFold(random_state=1)
n_scores_rcv = cross_val_score(model_rcv, X_train_pressure, pressures_train, scoring='r2', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Avg. error: %.3f (%.3f)' % (np.mean(n_scores_rcv), np.std(n_scores_rcv)))
sns.distplot(n_scores_lr , label = 'RidgeCV Cross Validation')
plt.legend()
plt.show()
print('\n Most important feature for RidgeCV\n')
most_important_feature = find_most_important_feature(X_train_pressure,model_rcv)
print(most_important_feature)
RidgeCV also looks like a very good model for predicting tilt-erupt value at explosion time.
Decision Tree¶
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor(random_state=1)
dt_parameters = {
'max_depth':[2,4,8,12,16,24,32],
'max_features':['auto','sqrt','log2']}
dt_search = GridSearchCV(model,dt_parameters,cv=5,scoring='r2')
dt_search.fit(X_train_pressure, pressures_train)
#Import necessary metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
best_model_dt = dt_search.best_estimator_
best_params_dt = dt_search.best_params_
prediction_dt = best_model_dt.predict(X_test_pressure)
print("best parameters:",best_params_dt)
print("r2 score of best decision tree model: ",r2_score(prediction_dt,pressures_test))
print("Mean squared error of best decision tree model: ",mean_squared_error(prediction_dt,pressures_test))
Decision Tree is not a good model for predicting tilt-erupt at explosion time.
Random Forest¶
from sklearn.ensemble import RandomForestRegressor
model_rf = RandomForestRegressor(random_state=123)
rf_parameters = {
'max_depth':[2,4,8,12,16,20,24,32],
'n_estimators':[10, 20, 30, 40, 60, 80, 100, 150]}
rf_search = GridSearchCV(model_rf,rf_parameters,cv=5,scoring='r2')
rf_search.fit(X_train_pressure, pressures_train)
#Import necessary metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
best_model_rf = rf_search.best_estimator_
best_params_rf = rf_search.best_params_
prediction_rf = best_model_rf.predict(X_test_pressure)
print("best parameters:",best_params_rf)
print("r2 score of best random forest model: ",r2_score(prediction_rf,pressures_test))
print("Mean squared error of best random forest model: ",mean_squared_error(prediction_rf,pressures_test))
Random Forest seems to perform well, but when we have got a simple Linear model like Linear Regression and RidgeCV, why take on a complex model like Random Forest. We are not digging in to evaluate Random forest for this task.
KNeighborsRegressor¶
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor()
knn_parameters = {
'n_neighbors':[2,4,8,12,16,20,24,32]}
knn_search = GridSearchCV(knn,knn_parameters,cv=5,scoring='r2')
knn_search.fit(X_train_pressure, pressures_train)
#Import necessary metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
best_model_knn = knn_search.best_estimator_
best_params_knn = knn_search.best_params_
prediction_knn = best_model_knn.predict(X_test_pressure)
print("best parameters:",best_params_knn)
print("r2 score of best knn model: ",r2_score(prediction_knn,pressures_test))
print("Mean squared error of best knn model: ",mean_squared_error(prediction_knn,pressures_test))
print("MAE score of Linear Regression model: ",mean_absolute_error(prediction_rcv,pressures_test))
print("MAPE score of Linear Regression model: ",mean_absolute_percentage_error(prediction_rcv,pressures_test))
We have got better results with Linear Regression and RidgeCV. Hence we skip analysing this model.
Plot feature importance of different models for predicting tilt-erupt¶
def plot_feature_importance(features,model, isLinear):
print('\n Feature importance plot\n')
features = features.columns
importances = model.coef_ if (isLinear == True ) else model.feature_importances_
indices = np.argsort(importances)[::-1]
fig, ax =plt.subplots(figsize=(4,4))
plt.title('Feature Importances')
plt.barh(range(5), [importances[i] for i in indices[0:5]], color='b', align='center')
plt.yticks(range(5), [features[i] for i in indices[0:5]])
plt.xlabel('Relative Importance')
plt.show()
for i in indices[0:5] :
print(features[i],importances[i])
return features[indices[0]]
RidgeCV¶
most_important_feature_rcv_p = plot_feature_importance(X_train_pressure,model_rcv, True)
Linear Regression¶
most_important_feature_lr_p = plot_feature_importance(X_train_pressure,best_model_lr, True)
Random Forest¶
most_important_feature_rf_p = plot_feature_importance(X_train_pressure,best_model_rf, False)
Prediction of time to eruption¶
w= 300, overlap = 10¶
w_t = 300
ov_t = 10
# Shuffle the array of observations so that it is
# not always the same set of observations which are training and test
np.random.shuffle(new_arrays)
computeFeatureVector(0, 150, 'training', w_t, ov_t, True)
computeFeatureVector(150, 189, 'test', w_t, ov_t, True)
X_train_to = pd.read_csv( path_prefix + 'WindowsComprehensiveExtractedFeatures' + str(w_t) + ',' + str(ov_t) + 'ptimetraining.csv')
print(X_train_to.shape)
X_test_to = pd.read_csv(path_prefix + 'WindowsComprehensiveExtractedFeatures' + str(w_t) + ',' + str(ov_t) + 'ptimetest.csv')
print(X_test_to.shape)
times_train = X_train_to['y']
times_test = X_test_to['y']
X_train_t = X_train_to.drop(['y', 'obs_id'], axis = 1)
X_test_t = X_test_to.drop(['y', 'obs_id'], axis = 1)
column_names = X_train_t.columns
scaler_t = StandardScaler()
X_train_ts = pd.DataFrame(scaler.fit_transform(X_train_t), columns=column_names)
X_test_ts = pd.DataFrame(scaler.transform(X_test_t), columns=column_names)
from tsfresh import select_features
X_train_times = select_features(X_train_ts, times_train)
X_test_times = X_test_ts[X_train_times.columns]
print(X_train_times.shape)
print(X_test_times.shape)
for x in X_train_times.columns:
print(x)
from lazypredict.Supervised import LazyRegressor
lpredict = LazyRegressor(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = lpredict.fit(X_train_times, X_test_times, times_train, times_test)
print(models)
Among the models evaluated by LazyPredict, let us first consider Linear Models and then go to ensemble models. Although ensemble models seems to have given the best results, let us first start with simpler models. Among Linear Models, RidgeCV showed R2 value of 0.93 and time taken is as low as 0.25. Among ensembles, AdaBoostRegressor took the least amount of time of 2.04 for similar R2 value as GradientBoostRegressor. Let us evaluate RidgeCV first, then the ensemble models.
RidgeCV¶
rcv_parameters = {
'alpha_per_target': [True, False],
'fit_intercept':[True, False],
'gcv_mode':['auto', 'svd','eigen']
}
model_rcv_t = RidgeCV(alphas = [0.1, 0.5, 1.0, 5.0, 10.0, 100, 150, 200],cv=5, scoring = 'r2')
model_rcv_t = GridSearchCV(model_rcv_t,rcv_parameters,cv=5,scoring='r2')
model_rcv_t.fit(X_train_times, times_train)
#model_rcv_t = RidgeCV(alphas=[0.1, 0.5, 1.0, 5.0, 10.0, 100, 150, 200], cv=5, scoring = 'r2', gcv_mode = ['auto', 'svd','eigen'])
# Fit the model to the training data
#model_rcv_t.fit(X_train_times, times_train)
# Make predictions on the test data
prediction_rcv_t = model_rcv_t.predict(X_test_times)
#print("Best alpha:", model_rcv_t.alpha_)
best_model_rcv_t = model_rcv_t.best_estimator_
print("Best alpha:", best_model_rcv_t.alpha_)
print("best parameters:",model_rcv_t.best_params_)
print("r2 score of RidgeCV model: ",r2_score(prediction_rcv_t,times_test))
print("Mean squared error of RidgeCV model: ",mean_squared_error(prediction_rcv_t,times_test))
print("MAE score of RidgeCV model: ",mean_absolute_error(prediction_rcv_t,times_test))
print("MAPE score of RidgeCV model: ",mean_absolute_percentage_error(prediction_rcv_t,times_test))
print("Best alpha:", best_model_rcv_t.alpha_)
from sklearn.ensemble import RandomForestRegressor
model_rf_time = RandomForestRegressor(random_state=123)
rf_parameters = {
'max_depth':[2,4,8,12,16,20,24,32],
'n_estimators':[10, 20, 30]}
rf_search_time = GridSearchCV(model_rf_time,rf_parameters,cv=5,scoring='r2')
rf_search_time.fit(X_train_times, times_train)
#Import necessary metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
best_model_rf_time = rf_search_time.best_estimator_
best_params_rf_time = rf_search_time.best_params_
prediction_rf_t = best_model_rf_time.predict(X_test_times)
print("best parameters:",best_params_rf_time)
print("r2 score of best random forest model: ",r2_score(prediction_rf_t,times_test))
print("Mean squared error of best random forest model: ",mean_squared_error(prediction_rf_t,times_test))
print("MAE score of RF model: ",mean_absolute_error(prediction_rf_t,times_test))
print("MAPE score of RF model: ",mean_absolute_percentage_error(prediction_rf_t,times_test))
GradientBoostRegressor¶
from sklearn.ensemble import GradientBoostingRegressor
# Define the model
model_gbr = GradientBoostingRegressor()
# Define the grid of hyperparameters to search
param_grid_gbr = {
'n_estimators': [2, 4, 10, 20, 30],
'learning_rate': [0.01, 0.1, 0.5],
'max_depth': [3, 5, 7]
}
# Perform grid search with cross-validation
grid_search_gbr = GridSearchCV(model_gbr, param_grid_gbr, cv=5, scoring='r2')
grid_search_gbr.fit(X_train_times, times_train)
# Get the best model and hyperparameters
best_model_gbr = grid_search_gbr.best_estimator_
best_params_gbr = grid_search_gbr.best_params_
# Print the best hyperparameters found
print("Best Hyperparameters:", best_params_gbr)
# Make predictions on the test set
prediction_gbr = best_model_gbr.predict(X_test_times)
# Print the R-squared score
print("R-squared score:", r2_score(times_test, prediction_gbr))
print("Mean squared error of best random forest model: ",mean_squared_error(prediction_gbr,times_test))
print("MAE score of RF model: ",mean_absolute_error(prediction_gbr,times_test))
print("MAPE score of RF model: ",mean_absolute_percentage_error(prediction_gbr,times_test))
AdaBoostRegressor¶
from sklearn.ensemble import AdaBoostRegressor
# Define the model
model_abr = AdaBoostRegressor()
# Define the grid of hyperparameters to search
param_grid_abr = {
'n_estimators': [5, 10, 15, 30, 40],
'learning_rate': [0.01, 0.1, 0.5],
}
# Perform grid search with cross-validation
grid_search_abr = GridSearchCV(model_abr, param_grid_abr, cv=5, scoring='r2')
grid_search_abr.fit(X_train_times, times_train)
# Get the best model and hyperparameters
best_model_abr = grid_search_abr.best_estimator_
best_params_abr = grid_search_abr.best_params_
# Print the best hyperparameters found
print("Best Hyperparameters:", best_params_abr)
# Make predictions on the test set
prediction_abr = best_model_abr.predict(X_test_times)
# Print the R-squared score
print("R-squared score:", r2_score(times_test, prediction_abr))
print("Mean squared error of best ABR model: ",mean_squared_error(prediction_abr,times_test))
print("MAE score of ABR model: ",mean_absolute_error(prediction_abr,times_test))
print("MAPE score of ABR model: ",mean_absolute_percentage_error(prediction_abr,times_test))
The following text is provided by ChatGPT.
Choosing between RidgeCV and an ensemble model like AdaBoostRegressor or GradientBoostingRegressor depends on various factors such as interpretability, computational efficiency, and model complexity. Here are some considerations:
Interpretability: RidgeCV provides coefficients for each feature, allowing for easier interpretation of the model's behavior. On the other hand, ensemble models like AdaBoostRegressor or GradientBoostingRegressor are generally more complex, making it harder to interpret individual feature contributions.
Computational Efficiency: RidgeCV is generally faster to train compared to ensemble models, especially when dealing with large datasets. Ensemble models often require training multiple weak learners or base estimators, which can be computationally expensive.
Model Complexity: RidgeCV typically produces a simpler model compared to ensemble models. If a simpler model is preferred due to concerns about overfitting or interpretability, RidgeCV might be a better choice. Robustness: Ensemble models can often handle a wider range of data distributions and relationships compared to RidgeCV. They are also less sensitive to outliers and noise in the data.
Generalization Performance: While both RidgeCV and ensemble models can achieve similar R-square values on training data, their performance on unseen data might differ. It's important to evaluate the models on validation or test data to assess their generalization performance. In conclusion, if interpretability, computational efficiency, and model simplicity are important considerations, RidgeCV might be preferred. However, if robustness and potential for higher predictive performance are more important, ensemble models like AdaBoostRegressor or GradientBoostingRegressor might be preferred, even if they show similar R-square values on training data.
RidgeCV gave very good values for all metrics. As it is a simple model, let us evaluate it in detail first.
Evaluation of RidgeCV model for predicting time to eruption¶
Residual Evaluation¶
prediction_rcv_train_t = best_model_rcv_t.predict(X_train_times)
train_residuals_rcv_t = times_train - prediction_rcv_train_t
test_residuals_rcv_t = times_test - prediction_rcv_t
Plot residual and density against predicted for training and test¶
# Residual evaluation plots
plt.figure(figsize=(12, 6))
# Residuals vs Predictions plot
plt.subplot(2, 2, 1)
plt.scatter(times_test, test_residuals_rcv_t, c='green', marker='s', label='Test data')
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Values')
plt.axhline(y=0, color='red', linestyle='--')
plt.legend()
plt.subplot(2, 2, 2)
plt.scatter(times_train, train_residuals_rcv_t, c='blue', marker='o', label='Training data')
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Values')
plt.axhline(y=0, color='red', linestyle='--')
plt.legend()
# Residuals distribution plot
plt.subplot(2, 2, 3)
sns.histplot(test_residuals_rcv_t, bins = 20, color='green', label='Test residuals', kde = True)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')
plt.legend()
plt.subplot(2, 2, 4)
sns.histplot(train_residuals_rcv_t, bins = 20, color='blue', label='Training residuals', kde = True)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')
plt.legend()
plt.tight_layout()
plt.show()
QQ Plot of residuals¶
from cProfile import label
import statsmodels.api as sm
import scipy.stats as stats
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
sm.qqplot(test_residuals_rcv_t, stats.t, fit=True, line="45", ax=ax1)
ax1.set_xlabel('Theoretical Quantiles')
ax1.set_ylabel('Sample Quantiles')
ax1.set_title('Q-Q Plot of Test Residuals')
sm.qqplot(train_residuals_rcv_t, stats.t, fit=True, line="45", ax=ax2)
ax2.set_xlabel('Theoretical Quantiles')
ax2.set_ylabel('Sample Quantiles')
ax2.set_title('Q-Q Plot of Training Residuals')
plt.tight_layout()
plt.show()
See that fpr both test and training, the residuals follow the 45 degree line closely, which is an indication of a good model and proves that residuals are normally distributed.
Analyse autocorrelation of residuals¶
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
sm.graphics.tsa.plot_acf(test_residuals_rcv_t, lags=100, ax = ax1)
ax1.set_xlabel('Lag')
ax1.set_ylabel('Autocorrelation')
ax1.set_title('Autocorrelation of Test Residuals')
sm.graphics.tsa.plot_acf(train_residuals_rcv_t, lags=100, ax = ax2)
ax2.set_xlabel('Lag')
ax2.set_ylabel('Autocorrelation')
ax2.set_title('Autocorrelation of Training Residuals')
plt.show()
We see that there is no auto correlation of residuals for any lag.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
ax1.scatter(prediction_rcv_t, times_test, label='Test data')
#ax1.axline(xy1=(0, 0), xy2 =(1,1), color='r', lw=2)
ax1.set_xlabel('Predicted')
ax1.set_ylabel('Actual')
ax1.grid(True)
ax1.legend()
ax1.set_title('predicted vs actual')
ax2.plot(prediction_rcv_train_t, times_train, 'o', label='Training data')
ax2.set_xlabel('Predicted')
ax2.set_ylabel('Actual')
ax2.set_title('predicted vs actual')
ax2.grid(True)
plt.legend()
plt.tight_layout()
plt.show()
Slight x-axis imbalance in the residuals vs predicted plo is explained here with this plot. The predicted and actual follow each other closely.
Cross-validation check on training data.¶
Evaluation of AdaBoostRegressor for predicting time to eruption¶
# evaluate the model
cv = RepeatedKFold(random_state=1)
n_scores_rcv_t = cross_val_score(best_model_rcv_t, X_train_times, times_train, scoring='r2', cv=cv, n_jobs=-1, error_score='raise')
#n_scores_best = cross_val_score(best_model_rf, X_train_pressure, pressures_train, scoring='r2', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Avg. error: %.3f (%.3f)' % (np.mean(n_scores_rcv_t), np.std(n_scores_rcv_t)))
#print('Avg. error: %.3f (%.3f)' % (np.mean(n_scores_best), np.std(n_scores_best)))
sns.distplot(n_scores_rcv_t)
Residual Evaluation¶
prediction_abr_t_train = best_model_abr.predict(X_train_times)
train_residuals_abr_t = times_train - prediction_abr_t_train
test_residuals_abr_t = times_test - prediction_abr
Plot residual and density against predicted for training and test¶
# Residual evaluation plots
plt.figure(figsize=(12, 6))
# Residuals vs Predictions plot
plt.subplot(2, 2, 1)
plt.scatter(times_test, test_residuals_abr_t, c='green', marker='s', label='Test data')
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Values')
plt.axhline(y=0, color='red', linestyle='--')
plt.legend()
plt.subplot(2, 2, 2)
plt.scatter(times_train, train_residuals_abr_t, c='blue', marker='o', label='Training data')
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Values')
plt.axhline(y=0, color='red', linestyle='--')
plt.legend()
# Residuals distribution plot
plt.subplot(2, 2, 3)
sns.histplot(test_residuals_abr_t, bins = 20, color='green', label='Test residuals', kde = True)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')
plt.legend()
plt.subplot(2, 2, 4)
sns.histplot(train_residuals_abr_t, bins = 20, color='blue', label='Training residuals', kde = True)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')
plt.legend()
plt.tight_layout()
plt.show()
QQPlot¶
from cProfile import label
import statsmodels.api as sm
import scipy.stats as stats
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
sm.qqplot(test_residuals_abr_t, stats.t, fit=True, line="45", ax=ax1)
ax1.set_xlabel('Theoretical Quantiles')
ax1.set_ylabel('Sample Quantiles')
ax1.set_title('Q-Q Plot of Test Residuals')
sm.qqplot(train_residuals_abr_t, stats.t, fit=True, line="45", ax=ax2)
ax2.set_xlabel('Theoretical Quantiles')
ax2.set_ylabel('Sample Quantiles')
ax2.set_title('Q-Q Plot of Training Residuals')
plt.tight_layout()
plt.show()
Analyse autocorrelation of residuals¶
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
sm.graphics.tsa.plot_acf(test_residuals_abr_t, lags=100, ax = ax1)
ax1.set_xlabel('Lag')
ax1.set_ylabel('Autocorrelation')
ax1.set_title('Autocorrelation of Test Residuals')
sm.graphics.tsa.plot_acf(train_residuals_abr_t, lags=100, ax = ax2)
ax2.set_xlabel('Lag')
ax2.set_ylabel('Autocorrelation')
ax2.set_title('Autocorrelation of Training Residuals')
plt.show()
Plot actual vs predicted¶
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
ax1.scatter(prediction_abr, times_test, label='Test data')
#ax1.axline(xy1=(0, 0), xy2 =(1,1), color='r', lw=2)
ax1.set_xlabel('Predicted')
ax1.set_ylabel('Actual')
ax1.grid(True)
ax1.legend()
ax1.set_title('predicted vs actual')
ax2.plot(prediction_abr_t_train, times_train, 'o', label='Training data')
ax2.set_xlabel('Predicted')
ax2.set_ylabel('Actual')
ax2.set_title('predicted vs actual')
ax2.grid(True)
plt.legend()
plt.tight_layout()
plt.show()
This explins the X axis imbalance in the residual vs predicted plot before.
Cross validation against training data¶
# evaluate the model
cv = RepeatedKFold(random_state=1)
n_scores_abr_t = cross_val_score(best_model_abr, X_train_times, times_train, scoring='r2', cv=cv, n_jobs=-1, error_score='raise')
#n_scores_best = cross_val_score(best_model_rf, X_train_pressure, pressures_train, scoring='r2', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Avg. error: %.3f (%.3f)' % (np.mean(n_scores_abr_t), np.std(n_scores_abr_t)))
#print('Avg. error: %.3f (%.3f)' % (np.mean(n_scores_best), np.std(n_scores_best)))
sns.distplot(n_scores_abr_t)
Evaluation of Random Forest Regressor for predicting time to eruption¶
prediction_rf_train_t = best_model_rf_time.predict(X_train_times)
train_residuals_rf_t = times_train - prediction_rf_train_t
test_residuals_rf_t = times_test - prediction_rf_t
Residual Analysis¶
Plot residual and density against predicted for training and test¶
# Residual evaluation plots
plt.figure(figsize=(12, 6))
# Residuals vs Predictions plot
plt.subplot(2, 2, 1)
plt.scatter(times_test, test_residuals_rf_t, c='green', marker='s', label='Test data')
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Values')
plt.axhline(y=0, color='red', linestyle='--')
plt.legend()
plt.subplot(2, 2, 2)
plt.scatter(times_train, train_residuals_rf_t, c='blue', marker='o', label='Training data')
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Values')
plt.axhline(y=0, color='red', linestyle='--')
plt.legend()
# Residuals distribution plot
plt.subplot(2, 2, 3)
sns.histplot(test_residuals_rf_t, bins = 20, color='green', label='Test residuals', kde = True)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')
plt.legend()
plt.subplot(2, 2, 4)
sns.histplot(train_residuals_rf_t, bins = 20, color='blue', label='Training residuals', kde = True)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')
plt.legend()
plt.tight_layout()
plt.show()
QQ Plot¶
from cProfile import label
import statsmodels.api as sm
import scipy.stats as stats
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
sm.qqplot(test_residuals_rf_t, stats.t, fit=True, line="45", ax=ax1)
ax1.set_xlabel('Theoretical Quantiles')
ax1.set_ylabel('Sample Quantiles')
ax1.set_title('Q-Q Plot of Test Residuals')
sm.qqplot(train_residuals_rf_t, stats.t, fit=True, line="45", ax=ax2)
ax2.set_xlabel('Theoretical Quantiles')
ax2.set_ylabel('Sample Quantiles')
ax2.set_title('Q-Q Plot of Training Residuals')
plt.tight_layout()
plt.show()
Analysis of Autocorrelation of residuals¶
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
sm.graphics.tsa.plot_acf(test_residuals_rf_t, lags=100, ax = ax1)
ax1.set_xlabel('Lag')
ax1.set_ylabel('Autocorrelation')
ax1.set_title('Autocorrelation of Test Residuals')
sm.graphics.tsa.plot_acf(train_residuals_rf_t, lags=100, ax = ax2)
ax2.set_xlabel('Lag')
ax2.set_ylabel('Autocorrelation')
ax2.set_title('Autocorrelation of Training Residuals')
plt.show()
Plot actual vs predicted¶
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
ax1.scatter(prediction_rf_t, times_test, label='Test data')
#ax1.axline(xy1=(0, 0), xy2 =(1,1), color='r', lw=2)
ax1.set_xlabel('Predicted')
ax1.set_ylabel('Actual')
ax1.grid(True)
ax1.legend()
ax1.set_title('predicted vs actual')
ax2.plot(prediction_rf_train_t, times_train, 'o', label='Training data')
ax2.set_xlabel('Predicted')
ax2.set_ylabel('Actual')
ax2.set_title('predicted vs actual')
ax2.grid(True)
plt.legend()
plt.tight_layout()
plt.show()
Cross validation on training data¶
# evaluate the model
cv = RepeatedKFold(random_state=1)
n_scores = cross_val_score(best_model_rf_time, X_train_times, times_train, scoring='r2', cv=cv, n_jobs=-1, error_score='raise')
#n_scores_best = cross_val_score(best_model_rf, X_train_pressure, pressures_train, scoring='r2', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Avg. error: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))
#print('Avg. error: %.3f (%.3f)' % (np.mean(n_scores_best), np.std(n_scores_best)))
sns.distplot(n_scores)
#sns.distplot(n_scores_best)
Plot Feature importance of different models for predicting time to eruption¶
Feature importance of RidgeCV¶
most_important_feature_rcv_time = plot_feature_importance(X_train_times,best_model_rcv_t, True)
plt.scatter(X_test_times[most_important_feature_rcv_time], times_test, color='black')
plt.scatter(X_test_times[most_important_feature_rcv_time], prediction_rf_t, color='blue', linewidth=3)
plt.ylabel(('time to eruption'))
plt.xlabel((most_important_feature_rcv_time))
plt.show()
AdaBoostRegressor¶
most_important_feature_abr_time = plot_feature_importance(X_train_times,best_model_abr, False)
plt.scatter(X_test_times[most_important_feature_abr_time], times_test, color='black')
plt.scatter(X_test_times[most_important_feature_abr_time], prediction_rf_t, color='blue', linewidth=3)
plt.ylabel(('time to eruption'))
plt.xlabel((most_important_feature_abr_time))
plt.show()
GradientBoostRegressor¶
#Feature importance of GradientBoostingRegressor
most_important_feature_gbr_time = plot_feature_importance(X_train_times,best_model_gbr, False)
plt.scatter(X_test_times[most_important_feature_gbr_time], times_test, color='black')
plt.scatter(X_test_times[most_important_feature_gbr_time], prediction_rf_t, color='blue', linewidth=3)
plt.ylabel(('time to eruption'))
plt.xlabel((most_important_feature_gbr_time))
plt.show()
Feature imprtance of Random Forest¶
most_important_feature_rf_time = plot_feature_importance(X_train_times,best_model_rf_time, False)
plt.scatter(X_test_times[most_important_feature_rf_time], times_test, color='black')
plt.scatter(X_test_times[most_important_feature_rf_time], prediction_rf_t, color='blue', linewidth=3)
plt.ylabel(('time to eruption'))
plt.xlabel((most_important_feature_rf_time))
plt.show()
For all the models considered for predicting time to eruption, it is the same 2 most significant features, which are :
- Pressure__range_count__max_1__min_-1
- Pressure__length