5.10 Predictions
To compute predicted values for the mean outcome for every observation, use predict()
(these were referred to as the fitted values in Section 5.6). Add interval = "confidence"
to obtain a 95% confidence interval for each fitted value. For example, for Example 5.1, the code would be as follows.
predict(fit.ex5.1, interval = "confidence")
But what if you only want to compute the predicted mean outcome value at a single specific set of predictor values? Even if the set of predictor values of interest occurs in your dataset, it is easier to compute the prediction for this set than to search through the observations to find one with matching predictor values in order to examine its fitted value.
To get a prediction, you could plug the predictor values into the regression equation, multiplying each by its corresponding regression coefficient, summing the products, and adding the intercept. However, it is much easier to use predict()
to do the calculation for you, with the added benefit of also getting a 95% confidence interval.
The optional second argument to predict()
is a data.frame
containing predictor values. The predictor values must be in the same format as those used to fit the model. For a numeric predictor, specify a number (not in quotes). For a categorical predictor, specify a level (in quotes). To make sure you enter a legitimate factor level, use levels()
to check the spelling.
NOTE: If any categorical predictor level is misspelled, predict()
will return an error. However, specifying one or more continuous predictor values beyond the range observed in the data will, unfortunately, not return an error or even a warning. Therefore, when making predictions, be careful to only predict at values within the range of the data used to fit the model. For example, if the model were fit using data from those age 18 years and older, a prediction for 10-year-olds would be invalid (see Section 5.25).
Example 5.1 (continued): Estimate the predicted mean fasting glucose at the following values of waist circumference, smoking status, age, gender, race/ethnicity, and income:
- WC = 130 cm
- Smoker = Current
- Age = 50 years
- Gender = Male
- Race/ethnicity = Non-Hispanic Black
- Income = $55,000+
First, check the spelling of the levels (results not shown).
levels(nhanesf.complete$smoker)
levels(nhanesf.complete$RIAGENDR)
levels(nhanesf.complete$race_eth)
levels(nhanesf.complete$income)
Next, use predict()
with the appropriate data.frame
as the second argument.
# Use predict() with a data.frame with predictor levels
predict(fit.ex5.1, data.frame(
BMXWAIST = 130,
smoker = "Current",
RIDAGEYR = 50,
RIAGENDR = "Male",
race_eth = "Non-Hispanic Black",
income = "$55,000+"),
interval = "confidence")
## fit lwr upr
## 1 7.168 6.711 7.625
Conclusion: The predicted mean fasting glucose among individuals with the specified predictor values is 7.17 mmol/L (95% CI = 6.71, 7.63).