Monday, August 28, 2017

Predictive Models in R

Environment: RStudio

Make sure you have R and RStudio installed on your machine. If not, you can follow the links below for the same.

Install R
Install R on your machine

Install RStudio
Install RStudio on your machine. 

We'll make predictive model using "mtcars" dataset which is one of the built-in dataset in RStudio. This dataset has 11 variables namely "mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear" and "carb" for 32 car models.
Load this dataset in variable myData
Let's make a model to predict "mpg" using “qsec”, “hp” and “wt” attributes. We'll use "caret" package to make models. Install and load the caret package
Separate the chunk of data we need 
Replace all the blank cells with NA and then remove all the rows with any cell as "NA"
Define the train control for training the model using Repeated K-fold Cross Validation
Let's build a model using the linear regression 
To view the model summary,
Predict the "mpg" using Linear Regression
Calculate the accuracy of model. For Regression, there are two main metrics to calculate the accuracy of a model. First is RMSE(Root Mean Square Error) and the other is R2(R Squared). R2 has value between 0 and 1 and RMSE has a value greater than 0. Higher the value of R2, better the model and lower the value of RMSE, better the model. R2 is a better metric as it gives a better view of the accuracy of a model.
Let us make another model using Random Forest Algorithm. Only one step changes. To cater the randomness of algorithm, use set.seed() to get the same metric measurements everytime you run the model
train the model using Random Forest Algorithm
Predict the "mpg" using random forest
Measure accuracy of the model RMSE and R2 
To see the metrics of both the models in a single graph, use the following
This plot will show the metrics of both the models in the form of a bwplot and also arranges the models in the ascending order of accuracy. The model at the lowest is the most accurate. When we apply repeated cv, bwplot plots multiple RMSEs and finds the model which is overall the best among multiple iterations.