Monday, August 28, 2017

Predictive Models in R

Environment: RStudio

Make sure you have R and RStudio installed on your machine. If not, you can follow the links below for the same.

Install R
Install R on your machine

Install RStudio
Install RStudio on your machine. 

We'll make predictive model using "mtcars" dataset which is one of the built-in dataset in RStudio. This dataset has 11 variables namely "mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear" and "carb" for 32 car models.
Load this dataset in variable myData
myData=mtcars
view raw load_data.js hosted with ❤ by GitHub
Let's make a model to predict "mpg" using “qsec”, “hp” and “wt” attributes. We'll use "caret" package to make models. Install and load the caret package
install.packages("caret")
library(caret)
view raw caret.js hosted with ❤ by GitHub
Separate the chunk of data we need 
final_data = myData [ , c(“wt”,”mpg”,“qsec”, “hp”)]
view raw finalData.js hosted with ❤ by GitHub
Replace all the blank cells with NA and then remove all the rows with any cell as "NA"
final_data[final_data ==""]=NA
final_data = final_data[complete.cases(final_data), ]
view raw na.js hosted with ❤ by GitHub
Define the train control for training the model using Repeated K-fold Cross Validation
train_control=trainControl(method="repeatedcv", number=5, repeats=3)
view raw trainControl.js hosted with ❤ by GitHub
Let's build a model using the linear regression 
model_lm = train(mpg ~.,final_data,method="lm", trControl=train_control)
view raw model.js hosted with ❤ by GitHub
To view the model summary,
model_lm$finalModel
Predict the "mpg" using Linear Regression
predicted_lm=predict(model_lm,final_data)
view raw predict.js hosted with ❤ by GitHub
Calculate the accuracy of model. For Regression, there are two main metrics to calculate the accuracy of a model. First is RMSE(Root Mean Square Error) and the other is R2(R Squared). R2 has value between 0 and 1 and RMSE has a value greater than 0. Higher the value of R2, better the model and lower the value of RMSE, better the model. R2 is a better metric as it gives a better view of the accuracy of a model.
rmse_lm= RMSE(predicted_lm, final_data$mpg)
r2_lm=R2(predicted_lm, final_data$mpg)
view raw metrics.js hosted with ❤ by GitHub
Let us make another model using Random Forest Algorithm. Only one step changes. To cater the randomness of algorithm, use set.seed() to get the same metric measurements everytime you run the model
#any number can be used
set.seed(7)
view raw setSeed.js hosted with ❤ by GitHub
train the model using Random Forest Algorithm
model_rf = train(mpg ~.,final_data,method="rf", trControl=train_control)
view raw rf.js hosted with ❤ by GitHub
Predict the "mpg" using random forest
predicted_rf=predict(model_rf,final_data)
view raw rfPredict.js hosted with ❤ by GitHub
Measure accuracy of the model RMSE and R2 
rmse_rf= RMSE(predicted_rf, final_data$mpg)
r2_rf=R2(predicted_rf, final_data$mpg)
view raw rfAccuracy.js hosted with ❤ by GitHub
To see the metrics of both the models in a single graph, use the following
allModels=resamples(list(Linear=model_lm, RandomForest=model_rf))
bwplot(allModels,scales=list(relation="free"))
view raw model_graph.js hosted with ❤ by GitHub
This plot will show the metrics of both the models in the form of a bwplot and also arranges the models in the ascending order of accuracy. The model at the lowest is the most accurate. When we apply repeated cv, bwplot plots multiple RMSEs and finds the model which is overall the best among multiple iterations.