Environment: RStudio
Make sure you have R and RStudio installed on your machine. If not, you can follow the links below for the same.
Install R
Install R on your machine
For windows: https://cran.r-project.org/bin/windows/base/
Install RStudio
Install RStudio on your machine.
We'll make predictive model using "mtcars" dataset which is one of the built-in dataset in RStudio. This dataset has 11 variables namely "mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear" and "carb" for 32 car models.
Load this dataset in variable myData
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
myData=mtcars |
Let's make a model to predict "mpg" using “qsec”, “hp” and “wt” attributes. We'll use "caret" package to make models. Install and load the caret package
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
install.packages("caret") | |
library(caret) |
Separate the chunk of data we need
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
final_data = myData [ , c(“wt”,”mpg”,“qsec”, “hp”)] |
Replace all the blank cells with NA and then remove all the rows with any cell as "NA"
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
final_data[final_data ==""]=NA | |
final_data = final_data[complete.cases(final_data), ] |
Define the train control for training the model using
Repeated K-fold Cross Validation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
train_control=trainControl(method="repeatedcv", number=5, repeats=3) |
Let's build a model using the linear regression
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
model_lm = train(mpg ~.,final_data,method="lm", trControl=train_control) |
To view the model summary,
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
model_lm$finalModel |
Predict the "mpg" using Linear Regression
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
predicted_lm=predict(model_lm,final_data) |
Calculate the accuracy of model. For Regression, there are two main metrics to calculate the accuracy of a model. First is RMSE(Root Mean Square Error) and the other is R2(R Squared). R2 has value between 0 and 1 and RMSE has a value greater than 0. Higher the value of R2, better the model and lower the value of RMSE, better the model. R2 is a better metric as it gives a better view of the accuracy of a model.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
rmse_lm= RMSE(predicted_lm, final_data$mpg) | |
r2_lm=R2(predicted_lm, final_data$mpg) |
Let us make another model using Random Forest Algorithm. Only one step changes. To cater the randomness of algorithm, use set.seed() to get the same metric measurements everytime you run the model
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#any number can be used | |
set.seed(7) |
train the model using Random Forest Algorithm
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
model_rf = train(mpg ~.,final_data,method="rf", trControl=train_control) |
Predict the "mpg" using random forest
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
predicted_rf=predict(model_rf,final_data) |
Measure accuracy of the model RMSE and R2
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
rmse_rf= RMSE(predicted_rf, final_data$mpg) | |
r2_rf=R2(predicted_rf, final_data$mpg) |
To see the metrics of both the models in a single graph, use the following
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
allModels=resamples(list(Linear=model_lm, RandomForest=model_rf)) | |
bwplot(allModels,scales=list(relation="free")) |
This plot will show the metrics of both the models in the form of a bwplot and also arranges the models in the ascending order of accuracy. The model at the lowest is the most accurate. When we apply repeated cv, bwplot plots multiple RMSEs and finds the model which is overall the best among multiple iterations.