predict function in R programming

predict is a generic function for predictions from the results of various model fitting functions. The function invokes particular methods which depend on the class of the first argument.

What is predict function in R programming?

predict is an S3 generic function – S3 is a style of object-oriented programming in R.

If an R package follows this style, some functions in base R can be extended – eg print, summary, plot, predict. These are called S3 generic functions.

Let’s say you’ve got your own class called ‘obj’, you can create a prediction function for this object and, if it is named as predict.obj, it’ll extend the generic function. R can identify which function to dispatch by .obj so that, if the object’s class is ‘obj’, your prediction function is executed. (Due to the way how R searches a function to dispatch, it is normally not recommended to put an end mark (.) in a function name for an S3 object if it is not intended to extend a generic function.)

For prediction, you can use predict.lm()predict.glm()predict.rpart() … or simply use predict() and let R run the correct function.

How does predict() function in R work?

Suppose you have a data set that contains a number of variables. For example, for each pupil at your school, you may have recorded the age, gender, shoe size and math grade. Hence, a row in your data frame corresponds to a pupil and the four columns correspond to the four variables.

You now fit some predictive model to find out how math grade can be predicted from the other variables. For example

myLinearModel = lm (mathgrade ~ shoesize+gender+age,data=PupilData)

or

myTreeModel = rpart (mathgrade ~ shoesize+gender+age,data=PupilData)

You can now find the predicted math grades according to each of the models:

Pupildata$linearPrediction = predict(myLinearModel)

Pupildata$treePrediction = predict(myTreeModel)

You can also use predict to see how the models perform on an independent validation data set:

ValidationPupildata$linearPrediction = predict(myLinearModel, newdata=ValidationPupildata)

ValidationPupildata$treePrediction = predict(myTreeModel, newdata=ValidationPupildata)

To see how well the two models predict math grades in the validation data set, you can make a scatter plot, and compare the correlation coefficients to see which is “better”:

attach(ValidationPupilData)

plot(mathgrade,linearPrediction,ylim=c(min(c(linearPrediction,treePrediction),max(c(linearPrediction,treePrediction))))

points(mathgrade,linearPrediction,col=”red”)

cor(mathGrade,linearPrediction)

cor(mathGrade,treePrediction)

Beware that for many models, predict() gives response values on a different scale than the data. For example, Poisson regression gives you the logarithm of the expected response. So you might compare exp(poissonPrediction) to mathGrade. Alternatively, use fitted.values instead of predict. Unfortunately, fitted.values doesn’t work with an independent validation set.

Some models also give you standard errors of the predictions.

Using the ‘predict’ function

Example

Once a model is built predict is the main function to test with new data. Our example will use the mtcars built-in dataset to regress miles per gallon against displacement:

my_mdl <- lm(mpg ~ disp, data=mtcars)
my_mdl

Call:
lm(formula = mpg ~ disp, data = mtcars)

Coefficients:
(Intercept)         disp  
   29.59985     -0.04122

If I had a new data source with displacement I could see the estimated miles per gallon.

set.seed(1234)
newdata <- sample(mtcars$disp, 5)
newdata
[1] 258.0  71.1  75.7 145.0 400.0

newdf <- data.frame(disp=newdata)
predict(my_mdl, newdf)
       1        2        3        4        5 
18.96635 26.66946 26.47987 23.62366 13.11381

The most important part of the process is to create a new data frame with the same column names as the original data. In this case, the original data had a column labeled disp, I was sure to call the new data that same name.

Caution

Let’s look at a few common pitfalls:

  1. not using a data.frame in the new object:predict(my_mdl, newdata) Error in eval(predvars, data, env) : numeric 'envir' arg not of length one
  2. not using same names in new data frame:newdf2 <- data.frame(newdata) predict(my_mdl, newdf2) Error in eval(expr, envir, enclos) : object 'disp' not found

Accuracy

To check the accuracy of the prediction you will need the actual y values of the new data. In this example, newdf will need a column for ‘mpg’ and ‘disp’.

newdf <- data.frame(mpg=mtcars$mpg[1:10], disp=mtcars$disp[1:10])
#     mpg  disp
# 1  21.0 160.0
# 2  21.0 160.0
# 3  22.8 108.0
# 4  21.4 258.0
# 5  18.7 360.0
# 6  18.1 225.0
# 7  14.3 360.0
# 8  24.4 146.7
# 9  22.8 140.8
# 10 19.2 167.6

p <- predict(my_mdl, newdf)

#root mean square error
sqrt(mean((p - newdf$mpg)^2, na.rm=TRUE))
[1] 2.325148

Happy predicting!

Below is a detailed explanation of the predict function in R programming.

Predict Method for a Linear Model

Description

Make predictions based on an lm object.

Usage

## S3 method for class 'lm':
predict(object, newdata = NULL, se.fit = FALSE, scale = NULL, df = Inf,
    interval = "none", level = 0.95,
    type = "response", terms = NULL, na.action = na.pass,
    pred.var = res.var/weights, weights = 1, ...) 		
## S3 method for class 'mlm':
predict(object, newdata = NULL, se.fit = FALSE, na.action = na.pass,
    ...)

Arguments

objecta fitted lm (or mlm) object.
newdataAn environemt, data frame, or list containing the values of the predictor variables at which predictions are required. This argument can be missing, in which case predictions are made at the same values used to compute the object. The predictors referred to in the right side of the formula in object must be present by name in newdata.
se.fitif TRUE, pointwise standard errors are computed along with the predictions. This is not available for multiresponse (mlm) models and se.fit=TRUE will cause an error for them.
scaleIf not NULL this is used instead of the scale recorded the lm object.
dfIf scale is not NULL, this is used as the number of residual degrees of freedom.
intervalthe interval type. it can be “none” (the default), “confidence” or “prediction”.
levelconfidence or prediction level.
typetype of predictions, with choices “response” (the default) or “terms”. If “response” is selected, the predictions are on the scale of the response. If type=”terms” is selected, the predictions are broken down into the contributions from each term. A matrix of predictions is produced, one column for each term in the model.
termsif type=”terms”, the terms= argument can be used to specify which terms should be included; the default is NULL. It means all terms are included. This argument is ignored when type is “response”.
na.actionthe function to handle missing values (NAs) for newdata. The default is na.pass.
pred.varprediction variance used instead of the estimated residual variance. Only used when interval is “prediction”.
weightsa numeric vector. Another way to specify pred.var: the estimated residual variance will be divided by weights to produce prev.var
further arguments passed to or from other methods

Details

This function is a method for the generic function predict for class lm. It can be invoked by calling predict for an object x of the appropriate class, or directly by calling predict.lm regardless of the class of the object. (In the latter case an error will occur or the results will be incorrect if the object is not sufficiently like an “lm” object.) Only the listed arguments are used for predict.mlm — any others are silently ignored.

Value

  • If type = “response”, the default, a vector of predictions or, if interval is “confidence” or “prediction”, a matrix of predictions with columns named “fit”, “lwr” and “upr” giving the fit and its lower and upper confidence or prediction intervals.
  • If type=”terms”, a matrix of term-wise fitted values is produced, with one column for each term in the model (or subset of these if the terms= argument is used). There will be no column for the intercept but its value will be attached as the attributed named “constant”. The row sums of this matrix, plus the constant term, will be the same as the predicted values given in the type=”response” case.
  • If se.fit=TRUE, the above fitted values will be the the “fit” component of a list. The other components of the returned list will be “se.fit”, a vector or matrix that same shape as “fit” containing the standard errors of the predicted values, “df”, the number of residual degrees of freedom, and “residual.scale”, the scale of the residuals.

When type is “terms” and interval is “confidence” or “prediction”, things are further rearranged. You will get a list with matrix components (with one column per term) “fit”, “se.fit”, “lrw”, and “upr” and scalar components “df” and “residual.scale”.

Warning

predict can produce incorrect predictions when the new data argument is used if the formula in the object involves data-dependent transformations, such as sqrt(Age – min(Age)). However, certain data-dependent transformations known to lm, such as poly, scale, and splines::ns are safe to use because lm uses makepredictcall to store the data-dependent information for the transformation in the attr(terms,”predvars”) part of its output.

Examples

fit <- lm(Fuel ~ Weight + Disp. + Type, data=Sdatasets::fuel.frame)
predict(fit)
predict(fit, newdata=data.frame(Weight=2750, Disp.=110, Type="Small"))
predict(fit, interval = "prediction",
    newdata=data.frame(Weight=2700, Disp.=300, Type="Sporty"))
predict(fit, type = "terms")

Hope you learned something from this post. The primary source of this article is StackOverflow.

Follow Programming Articles for more!

About ᴾᴿᴼᵍʳᵃᵐᵐᵉʳ

Linux and Python enthusiast, in love with open source since 2014, Writer at programming-articles.com, India.

View all posts by ᴾᴿᴼᵍʳᵃᵐᵐᵉʳ →