## 6.6 Estimating an OR using logistic regression

**Unadjusted, categorical predictor**

We can also estimate an OR using logistic regression via the `glm()`

function (“generalized linear model”).

**Example 6.2 (continued):** Use logistic regression to estimate the OR comparing the odds of lifetime marijuana use (`mj_lifetime`

) between males and females (`demog_sex`

).

First, re-level `demog_sex`

to make “Female” the reference level in order to estimate the OR comparing males to females.

```
<- nsduh %>%
nsduh mutate(demog_sex = relevel(demog_sex, ref = "Female"))
```

Next, use `glm()`

with `family = binomial`

to fit a logistic regression and `summary()`

to view the output.

```
.2 <- glm(mj_lifetime ~ demog_sex,
fit.ex6family = binomial, data = nsduh)
summary(fit.ex6.2)
```

```
##
## Call:
## glm(formula = mj_lifetime ~ demog_sex, family = binomial, data = nsduh)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.28 -1.12 1.08 1.08 1.24
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.1350 0.0867 -1.56 0.1195
## demog_sexMale 0.3678 0.1274 2.89 0.0039 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1386.0 on 999 degrees of freedom
## Residual deviance: 1377.6 on 998 degrees of freedom
## AIC: 1382
##
## Number of Fisher Scoring iterations: 3
```

We showed previously that the intercept (-0.13504) is the log-odds of lifetime marijuana use when all the predictors are zero or at their reference level. Use `ilogit()`

to convert a log-odds to a probability.

`ilogit(coef(fit.ex6.2)["(Intercept)"])`

```
## (Intercept)
## 0.4663
```

Thus, we estimate that the probability of marijuana use among females (the reference level) is \(p = e^{-0.13504} / (1 + e^{-0.13504}) = 0.47\) (the same value we computed using the 2 \(\times\) 2 table).

The regression coefficient for `demog_sexMale`

(0.36784) represents the log of the OR for lifetime marijuana use comparing males to females. To get the OR itself, exponentiate the coefficient using `exp()`

. Thus, we estimate the OR to be \(e^{0.36784}\) = 1.44 (again, the same value we computed using the 2 \(\times\) 2 table).

```
# [-1] drops the first row which corresponds to the intercept)
exp(coef(fit.ex6.2)[-1])
```

```
## demog_sexMale
## 1.445
```

Use `confint()`

to compute 95% CIs for the regression coefficients, and exponentiate these to get 95% CIs for the ORs (dropping the intercept since \(e^{\beta_0}\) is an odds not an OR).

**NOTE:** `confint()`

uses a profile likelihood method rather than the Wald method used by `summary()`

. Thus, the 95% CIs may not correspond exactly to the p-values. That is, you may have p < .05 but a 95% CI that contains 1, or p > .05 but a 95% CI that does not contain 1.

```
# CIs for regression coefficients
confint(fit.ex6.2)
```

```
## 2.5 % 97.5 %
## (Intercept) -0.3055 0.03476
## demog_sexMale 0.1186 0.61808
```

```
# CI for OR (use [-1,] to drop the first row which corresponds to the intercept)
# and drop=F to retain the rownames since there is only one predictor term
exp(confint(fit.ex6.2))[-1, , drop=F]
```

```
## 2.5 % 97.5 %
## demog_sexMale 1.126 1.855
```

In this example, the categorical predictor had just 2 levels, so get the p-value testing its association with the outcome from the regression coefficient table (0.00388). In general, use `car::Anova()`

to obtain the p-value for any predictor, continuous or categorical with any number of levels. Unlike with linear regression, there are two different options here, a likelihood ratio test or a Wald test. We will use the Wald test here to be consistent with the p-values in the regression coefficient table (which are equivalent to Wald tests). However, see Section 6.18 for how to obtain likelihood ratio tests throughout.

`::Anova(fit.ex6.2, type = 3, test.statistic = "Wald") car`

```
## Analysis of Deviance Table (Type III tests)
##
## Response: mj_lifetime
## Df Chisq Pr(>Chisq)
## (Intercept) 1 2.42 0.1195
## demog_sex 1 8.34 0.0039 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

**Conclusion:** Males have significantly greater odds of lifetime marijuana use than females (OR = 1.44; 95% CI = 1.13, 1.86; p = .004). Males have 44% greater odds of lifetime marijuana use than females.

For this simple case with a binary outcome and a binary predictor, the OR is exactly the same as that obtained via the cross-product from a 2 \(\times\) 2 table. An advantage of using logistic regression, however, is that it provides the ability to adjust for other predictors to obtain an adjusted OR (AOR), as we will see in Section 6.6.3.

**Unadjusted, continuous predictor**

Next, let’s fit a logistic regression with a continuous predictor and interpret the results.

**Example 6.3:** Using the 2019 National Survey of Drug Use and Health (NSDUH) teaching dataset (Section A.5), what is the association between lifetime marijuana use (`mj_lifetime`

) and age at first use of alcohol (`alc_agefirst`

)?

Fit the model and interpret the coefficients.

```
.3 <- glm(mj_lifetime ~ alc_agefirst, family = binomial, data = nsduh)
fit.ex6round(summary(fit.ex6.3)$coef, 4)
```

```
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.3407 0.4747 11.25 0
## alc_agefirst -0.2835 0.0267 -10.62 0
```

`ilogit(coef(fit.ex6.3)["(Intercept)"])`

```
## (Intercept)
## 0.9952
```

The inverse logit of the intercept (5.3407) is the probability of the outcome when all predictors are zero or at their reference level. Thus, 0.995 is the estimated probability of lifetime marijuana use among those who started drinking alcohol at age 0 years.

Examining a summary of the predictor, we see that the range of age of first use of alcohol is 3 to 45 years, and the median value is 17 years.

```
# Examine the predictor
summary(nsduh$alc_agefirst)
```

```
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 3.0 15.0 17.0 17.5 19.0 45.0 157
```

Thus, the probability of marijuana use for those who first drank alcohol at age 0 years is an extrapolation. Had we centered the predictor, the intercept would be more interpretable. For example, had we centered it at the median, 17 years, the intercept would be interpreted as the log-odds at the median age of first alcohol use, with a corresponding estimated prevalence of lifetime marijuana use among 17-year-olds of 0.63.

```
<- nsduh %>%
nsduh mutate(calc_agefirst = alc_agefirst - 17)
3.centered <- glm(mj_lifetime ~ calc_agefirst,
fit.ex6.family = binomial, data = nsduh)
ilogit(coef(fit.ex6.3.centered)["(Intercept)"])
```

```
## (Intercept)
## 0.6273
```

The regression coefficient for `alc_agefirst`

(-0.2835) is the difference in log-odds associated with a 1-year difference in age at first use of alcohol. Converting this to an OR by exponentiating, we get \(e^{-0.2835} = 0.753\).

`exp(coef(fit.ex6.3)[-1])`

```
## alc_agefirst
## 0.7531
```

Thus, starting drinking alcohol 1 year later is associated with 24.7% lower odds of lifetime marijuana use (since 1 - 0.753 = 0.247). Finally, compute a 95% CI for the OR using `confint()`

and exponentiating.

`exp(confint(fit.ex6.3)[-1,])`

```
## 2.5 % 97.5 %
## 0.7135 0.7923
```

**Conclusion:** Age at first alcohol use is significantly negatively associated with lifetime marijuana use (OR = 0.753; 95% CI = 0.713, 0.792; p < .001). Individuals who first used alcohol at, say, age 19 years have 24.7% lower odds of having ever used marijuana than those who first used alcohol at age 18 years.

### 6.6.1 OR associated with other than a 1-unit difference

If, after fitting the model, you want to know the increase or reduction in odds associated with other than a 1-unit difference in a continuous predictor, multiply the regression coefficient by that factor *before* exponentiating. Similarly, to compute the corresponding confidence interval, multiply the confidence limits for the regression coefficient by that factor prior to exponentiating.

If you want to have the logistic regression model itself estimate an OR corresponding to other than a 1-unit difference then, prior to fitting the model, *divide* the continuous predictor by the desired factor. In this way a 1-unit difference on the new scale will correspond to the desired difference on the original scale.

**Example 6.3 (continued):** To estimate the reduction in odds associated with starting drinking alcohol 3 years later, do *not* multiply 24.7% by 3 (which would be 74.1%). Instead, first compute the OR as \(e^{-0.2835 \times 3} = 0.427\) and then compute the percent change in odds. In other words, those who started using alcohol 3 years later have 57.3% lower odds of lifetime marijuana use (not 74.1%).

```
# OR for a 3-unit difference
exp(3*coef(fit.ex6.3))[-1]
```

```
## alc_agefirst
## 0.4272
```

```
# 95% CI for the OR for a 3-unit difference
exp(3*confint(fit.ex6.3))[-1,]
```

```
## 2.5 % 97.5 %
## 0.3632 0.4973
```

```
# Same result accomplished by transforming the
# predictor prior to fitting the model
<- nsduh %>%
tmpdat mutate(alc_agefirst3 = alc_agefirst/3)
<- glm(mj_lifetime ~ alc_agefirst3,
tmpfit family = binomial, data = tmpdat)
exp(coef(tmpfit))[-1]
```

```
## alc_agefirst3
## 0.4272
```

`exp(confint(tmpfit))[-1,]`

```
## 2.5 % 97.5 %
## 0.3632 0.4973
```

### 6.6.2 Make sure you know what probability `glm()`

is modeling

The outcome \(Y\) must be binary (exactly 2 unique values, not including `NA`

) and have numeric values 0 and 1 or be coded as a factor. If \(Y\) is numeric, then `glm()`

will model \(p = P(Y = 1)\). If \(Y\) is a factor, then `glm()`

will model \(p = P(Y = \textrm{the non-reference level})\). If the outcome is not coded to these specifications, or if you want to model the probability of the other level of the outcome, then recode it.

It is essential to know which level of the outcome corresponds to the probability \(p\) when using logistic regression. Use `is.factor()`

to check if \(Y\) is a factor, and `levels()`

to examine the levels (if it is a factor). The first level listed by `levels()`

is the reference level, so the second is the level whose probability is being modeled. If the outcome is not a factor, use `unique()`

to check the unique values – they should be 0 and 1.

**Example 6.3 (continued):** For the outcome \(Y\) = `mj_lifetime`

, which has possible values No (reference) and Yes, `glm()`

models \(p = P(Y = \textrm{Yes})\).

`is.factor(nsduh$mj_lifetime)`

`## [1] TRUE`

```
# If it is a factor, check the levels
levels(nsduh$mj_lifetime)
```

`## [1] "No" "Yes"`

```
# If it is not a factor, check
# the unique values.
# Should be 0 and 1
# unique(x)
```

### 6.6.3 Adjusted OR

Just as you can move from simple to multiple linear regression by adding more predictors, so you can move from simple to multiple logistic regression. When there are multiple predictors in a logistic regression model, the resulting odds ratios are called adjusted odds ratios (AOR).

**Example 6.3 (continued):** What is the association between lifetime marijuana use (`mj_lifetime`

) and age at first use of alcohol (`alc_agefirst`

), adjusted for age (`demog_age_cat6`

), sex (`demog_sex`

), and income (`demog_income`

)? Fit the model and compute the AOR, its 95% CI, and the p-value that tests the significance of age at first use of alcohol. While we are not primarily interested in the confounders, it is common to also report their AORs, 95% CIs, and p-values. Since there are some categorical confounders with more than two levels, use `car::Anova()`

to compute multiple df p-values.

```
3.adj <- glm(mj_lifetime ~ alc_agefirst + demog_age_cat6 + demog_sex +
fit.ex6.family = binomial, data = nsduh)
demog_income, # Regression coefficient table
round(summary(fit.ex6.3.adj)$coef, 4)
```

```
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.2542 0.5914 10.5759 0.0000
## alc_agefirst -0.2754 0.0276 -9.9922 0.0000
## demog_age_cat626-34 -0.2962 0.3286 -0.9012 0.3675
## demog_age_cat635-49 -0.8043 0.2966 -2.7120 0.0067
## demog_age_cat650-64 -0.6899 0.2985 -2.3109 0.0208
## demog_age_cat665+ -1.2748 0.3043 -4.1893 0.0000
## demog_sexMale -0.0609 0.1618 -0.3763 0.7067
## demog_income$20,000 - $49,999 -0.5309 0.2664 -1.9927 0.0463
## demog_income$50,000 - $74,999 -0.0793 0.3049 -0.2601 0.7948
## demog_income$75,000 or more -0.3612 0.2532 -1.4264 0.1538
```

```
# AORs and 95% CIs
# Use cbind to view the AORs and CIs side-by-side
# Add [-1,] to drop the row for the intercept
# since exp(intercept) is not an odds ratio
<- cbind("AOR" = exp(coef(fit.ex6.3.adj)),
OR.CI exp(confint(fit.ex6.3.adj)))[-1,]
round(OR.CI, 3)
```

```
## AOR 2.5 % 97.5 %
## alc_agefirst 0.759 0.718 0.800
## demog_age_cat626-34 0.744 0.387 1.409
## demog_age_cat635-49 0.447 0.247 0.791
## demog_age_cat650-64 0.502 0.275 0.891
## demog_age_cat665+ 0.279 0.152 0.502
## demog_sexMale 0.941 0.684 1.291
## demog_income$20,000 - $49,999 0.588 0.347 0.987
## demog_income$50,000 - $74,999 0.924 0.507 1.680
## demog_income$75,000 or more 0.697 0.421 1.139
```

```
# P-values
::Anova(fit.ex6.3.adj, type = 3, test.statistic = "Wald") car
```

```
## Analysis of Deviance Table (Type III tests)
##
## Response: mj_lifetime
## Df Chisq Pr(>Chisq)
## (Intercept) 1 111.85 < 0.0000000000000002 ***
## alc_agefirst 1 99.84 < 0.0000000000000002 ***
## demog_age_cat6 4 23.01 0.00013 ***
## demog_sex 1 0.14 0.70669
## demog_income 3 5.44 0.14197
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Interpretation of the output:

- The AOR for our primary predictor
`alc_agefirst`

is 0.759. This represents the OR for lifetime marijuana use comparing those with a one-year difference in age at first use of alcohol, adjusted for age, sex, and income. - The remaining AORs compare levels of categorical predictors to their reference level, adjusted for the other predictors in the model. For example, comparing individuals with the same age of first alcohol use, sex, and income, 35-49 year-olds have 55.3% lower odds of lifetime marijuana use than 18-25 year-olds (OR = 0.447; 95% CI = 0.247, 0.791; p = .007). The p-value for this specific comparison of ages comes from the
`Coefficients`

table. An overall, 4 df p-value for age, can be read from the`Type III Test`

table (0.00013). - The
`Type III tests`

output contains the multiple df Wald tests for categorical predictors with more than two levels. For continuous predictors, or for categorical predictors with exactly two levels, the Type III Wald test p-values are identical to those in the`Coefficients`

table.

**Conclusion:** After adjusting for age, sex, and income, age at first alcohol use is significantly negatively associated with lifetime marijuana use (AOR = 0.759; 95% CI = 0.718, 0.800; p < .001). Individuals who first used alcohol at a given age have 24.1% lower odds of having ever used marijuana than those who first used alcohol one year earlier.

**NOTE:** The steps of extracting the coefficients and CIs and exponentiating them are demonstrated here to make it clear where the numbers are coming from. However, the `car::S()`

function (Fox, Weisberg, and Price 2022; Fox and Weisberg 2019) can be used instead to conveniently produce output showing the `summary()`

, as well as the AORs and their 95% CIs (but you still need `car::Anova()`

for the Type III Wald tests).

```
::S(fit.ex6.3.adj)
car# (output not shown)
```

### References

*An R Companion to Applied Regression*. 3rd ed. Los Angeles: Sage Publications, Inc.

*Car: Companion to Applied Regression*. https://CRAN.R-project.org/package=car.