8.6 Weighted binary logistic regression

8.6.1 Fitting the model

To carry out a binary logistic regression that incorporates a survey design, use svyglm() with family=quasibinomial(). This produces the same results as family=binomial() but avoids a warning about non-integer numbers of successes. As with glm(), svyglm() models the probability that the outcome is at the non-reference level, if the outcome is a factor, or the probability that the outcome is 1, if the outcome is numeric with values 0 and 1 (see Section 6.6.2). Although not discussed in this text, survey-weighted ordinal logistic regression can be carried out using svyolr().

Example 8.3: Using data from adult participants in the 2019 NSDUH (nsduh2019_rmph.RData), what is the association between lifetime marijuana use (mj_lifetime) and age at first use of alcohol (alc_agefirst), adjusted for age (demog_age_cat6), sex (demog_sex), and income (demog_income)?

Since the question states “among adult participants”, use a domain analysis for this example. Our domain definition is based on one of the variables in the model (demog_age_cat6) and excludes individuals at certain levels of that variable. This does not affect any of the analyses below, but does lead to an issue with tbl_regression() as the excluded levels show up in the table, and show up multiple times. To avoid this problem, create a new version of this variable with missing values at the excluded levels and use fct_drop() to drop the unused levels. This must be done before specifying and subsetting the design so the new variable is available to the design.

load("Data/nsduh2019_rmph.RData")

# Check the levels of the age variable
table(nsduh$demog_age_cat6)

## 
## 12-17 18-25 26-34 35-49 50-64   65+ 
## 13397 14226  8601 11134  4880  3898

nsduh <- nsduh %>%
  mutate(domain = demog_age_cat6 %in%
           c("18-25", "26-34", "35-49", "50-64", "65+"),
         # So tbl_regression works correctly:
         # Set non-adult age group to missing
         age_new = na_if(demog_age_cat6, "12-17"),
         # Drop unused levels
         age_new = fct_drop(age_new))

# Check derivation
table(nsduh$age_new, nsduh$demog_age_cat6, useNA = "ifany")

##        
##         12-17 18-25 26-34 35-49 50-64   65+
##   18-25     0 14226     0     0     0     0
##   26-34     0     0  8601     0     0     0
##   35-49     0     0     0 11134     0     0
##   50-64     0     0     0     0  4880     0
##   65+       0     0     0     0     0  3898
##   <NA>  13397     0     0     0     0     0

Next, specify the survey design and subset the design.

library(survey)
options(survey.lonely.psu = "adjust")
options(survey.adjust.domain.lonely = TRUE)

design.NSDUH <- svydesign(strata=~vestr, id=~verep, weights=~ANALWT_C,
                        nest=TRUE, data=nsduh)

design.NSDUH.adults <- subset(design.NSDUH, domain)

Use svyglm() with family=quasibinomial() to fit the model, and the same summary(), confint(), and car::Anova() calls used for weighted linear regression to view the output.

fit.ex8.3 <- svyglm(mj_lifetime ~ alc_agefirst + 
                      age_new + demog_sex + demog_income,
                    family = quasibinomial(),
                    design = design.NSDUH.adults)
round(
  cbind(
    summary(fit.ex8.3,
              df.resid=degf(fit.ex8.3$survey.design))$coef,
    confint(fit.ex8.3,
              ddf=degf(fit.ex8.3$survey.design))
  )
, 4)

##                               Estimate Std. Error t value Pr(>|t|)   2.5 %  97.5 %
## (Intercept)                     4.8352     0.1433  33.743   0.0000  4.5474  5.1230
## alc_agefirst                   -0.2427     0.0074 -32.711   0.0000 -0.2576 -0.2278
## age_new26-34                    0.2048     0.0456   4.490   0.0000  0.1132  0.2964
## age_new35-49                   -0.1694     0.0429  -3.949   0.0002 -0.2556 -0.0832
## age_new50-64                   -0.1030     0.0516  -1.996   0.0514 -0.2067  0.0007
## age_new65+                     -0.7953     0.0565 -14.078   0.0000 -0.9088 -0.6819
## demog_sexFemale                -0.0672     0.0310  -2.169   0.0348 -0.1294 -0.0050
## demog_income$20,000 - $49,999  -0.1559     0.0362  -4.302   0.0001 -0.2287 -0.0831
## demog_income$50,000 - $74,999  -0.0804     0.0559  -1.438   0.1567 -0.1928  0.0319
## demog_income$75,000 or more    -0.1254     0.0430  -2.919   0.0053 -0.2117 -0.0391

car::Anova(fit.ex8.3, type = 3, test.statistic = "F",
           error.df = degf(fit.ex8.3$survey.design))

## Analysis of Deviance Table (Type III tests)
## 
## Response: mj_lifetime
##              Df       F               Pr(>F)    
## (Intercept)   1 1138.61 < 0.0000000000000002 ***
## alc_agefirst  1 1070.01 < 0.0000000000000002 ***
## age_new       4   92.48 < 0.0000000000000002 ***
## demog_sex     1    4.71              0.03483 *  
## demog_income  3    6.57              0.00079 ***
## Residuals    50                                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Compute the AORs and their 95% confidence intervals.

# Compute odds ratios and their 95% confidence intervals
OR.CI <- cbind("AOR" = exp(   coef(fit.ex8.3)),
                       exp(confint(fit.ex8.3,
                           df.resid=degf(fit.ex8.3$survey.design))))[-1,]
round(OR.CI, 3)

##                                 AOR 2.5 % 97.5 %
## alc_agefirst                  0.784 0.773  0.796
## age_new26-34                  1.227 1.119  1.346
## age_new35-49                  0.844 0.774  0.921
## age_new50-64                  0.902 0.813  1.001
## age_new65+                    0.451 0.403  0.506
## demog_sexFemale               0.935 0.878  0.995
## demog_income$20,000 - $49,999 0.856 0.795  0.921
## demog_income$50,000 - $74,999 0.923 0.824  1.033
## demog_income$75,000 or more   0.882 0.809  0.962

To table these results, use the tbl_regression() function in gtsummary after adding test.statistic = "F" in add_global_p(). As mentioned in Section 8.4.1, there is no option in tbl_regression() to change the DF to make the confidence intervals and p-values for the individual regression coefficients be based on the design DF. See the warning there for how to resolve this issue.

The results are shown in Table 8.6.

library(gtsummary)
t1 <- fit.ex8.3 %>%
  tbl_regression(intercept = T,
                 # Use style_number to round to 2 digits (e.g., 1.27)
                 estimate_fun = function(x) style_number(x, digits = 2),
                 # Use style_sigfig to keep 2 significant digits (e.g., 1.3)
                 # estimate_fun = function(x) style_sigfig(x, digits = 2),
                 label  = list('(Intercept)' ~ "Intercept",
                               alc_agefirst ~ "Age of first alcohol use (years)",
                               age_new ~ "Age (years)",
                               demog_sex      ~ "Sex",
                               demog_income   ~ "Income")) %>%
  modify_column_hide(p.value) %>%
  modify_caption("Weighted logistic regression results for
  lifetime marijuana use vs. age at first alcohol use (years)")

t2 <- fit.ex8.3 %>%
  tbl_regression(intercept = T,
                 exponentiate = T,  # OR = exp(B)
                 # Use style_number to round to 2 digits (e.g., 1.27)
                 estimate_fun = function(x) style_number(x, digits = 2),
                 # Use style_sigfig to keep 2 significant digits (e.g., 1.3)
                 pvalue_fun   = function(x) style_pvalue(x, digits = 3),
                 label  = list('(Intercept)' ~ "Intercept",
                               alc_agefirst ~ "Age of first alcohol use (years)",
                               age_new ~ "Age (years)",
                               demog_sex      ~ "Sex",
                               demog_income   ~ "Income")) %>%
  add_global_p(keep = T, test.statistic = "F"
  # Uncomment to use design df for Type III test p-values
  # (but then they will not match the regression term p-values for
  #  binary predictors)
  #             , error.df = degf(fit.ex8.1$survey.design)
                                 )

TABLE <- tbl_merge(
  tbls = list(t1, t2),
  tab_spanner = c("**Adjusted Coefficients**", "**Adjusted Odds Ratios**")
  ) %>%
  modify_footnote(update = everything() ~ NA, abbreviation = TRUE) %>% 
  modify_footnote(update = list(
    estimate_1 ~ "Intercept = log-odds at reference levels of predictors",
    estimate_2 ~ "e(Intercept) = odds at reference levels of predictors")) %>%
  modify_header(update = list(
    estimate_1 = "**log(AOR)**",
    estimate_2 = "**AOR**"
  ))

TABLE

Table 8.6: Weighted logistic regression results for lifetime marijuana use vs. age at first alcohol use (years)
Characteristic	Adjusted Coefficients		Adjusted Odds Ratios
Characteristic	log(AOR)¹	95% CI	AOR²	95% CI	p-value
Intercept	4.84	4.55, 5.12	125.86	94.24, 168.10	<0.001
Age of first alcohol use (years)	-0.24	-0.26, -0.23	0.78	0.77, 0.80	<0.001
Age (years)					<0.001
18-25	—	—	—	—
26-34	0.20	0.11, 0.30	1.23	1.12, 1.35	<0.001
35-49	-0.17	-0.26, -0.08	0.84	0.77, 0.92	<0.001
50-64	-0.10	-0.21, 0.00	0.90	0.81, 1.00	0.053
65+	-0.80	-0.91, -0.68	0.45	0.40, 0.51	<0.001
Sex					0.036
Male	—	—	—	—
Female	-0.07	-0.13, 0.00	0.94	0.88, 1.00	0.036
Income					<0.001
Less than $20,000	—	—	—	—
$20,000 - $49,999	-0.16	-0.23, -0.08	0.86	0.80, 0.92	<0.001
$50,000 - $74,999	-0.08	-0.19, 0.03	0.92	0.82, 1.03	0.158
$75,000 or more	-0.13	-0.21, -0.04	0.88	0.81, 0.96	0.006
¹ Intercept = log-odds at reference levels of predictors
² e(Intercept) = odds at reference levels of predictors

Conclusion: After adjusting for age, sex, and income, lifetime marijuana use is significantly negatively associated with age at first use of alcohol (AOR = 0.78; 95% CI = 0.77, 0.80; p < .001). Those who start using alcohol one year later have 22% lower odds of lifetime marijuana use.

8.6.2 Prediction

Use svycontrast_design_df() to compute estimated log-odds from a svyglm() object fit using family = quasibinomial() and use ilogit() to convert the log-odds to a probability.

Example 8.3 (continued): What is the estimated probability (and 95% CI) of lifetime marijuana use for someone who first used alcohol at age 13 years, is currently age 30 years, male, and has an income of < $20,000?

# Always include the intercept for estimation
# Specify a 1 for the intercept, a # for each continuous predictor
# and a 1 for each non-reference level of a categorical variable.
# If a predictor is at its reference level, specify a 0 or exclude it.
ilogit(
  svycontrast_design_df(fit.ex8.3,
                        c("(Intercept)"  = 1,
                          "alc_agefirst" = 13,
                          "age_new26-34" = 1))
)

##      est  lower  upper
## 1 0.8682 0.8554 0.8799

Conclusion: The predicted probability of lifetime marijuana use among such individuals is 86.8% (95% CI = 85.5%, 88.0%).

8.6.3 Interactions

Including an interaction in the model and estimating the effect of one variable at levels of another proceeds as described for weighted linear regression above with the addition of exponentiating to obtain AORs.

Example 8.3 (continued): After adjusting for age and income, does the association between lifetime marijuana use and age at first use of alcohol differ by sex? What are the AORs for age at first alcohol use for males and females?

fit.ex8.3.int <- svyglm(mj_lifetime ~ alc_agefirst + 
                          age_new + demog_sex + demog_income + 
                          alc_agefirst:demog_sex,
                        family =quasibinomial(),
                        design = design.NSDUH.adults)

round(
  cbind(
    summary(fit.ex8.3.int,
              df.resid=degf(fit.ex8.3.int$survey.design))$coef,
    confint(fit.ex8.3.int,
              ddf=degf(fit.ex8.3.int$survey.design))
  )
, 4)

##                               Estimate Std. Error t value Pr(>|t|)   2.5 %  97.5 %
## (Intercept)                     4.4769     0.1719  26.048   0.0000  4.1317  4.8222
## alc_agefirst                   -0.2219     0.0096 -23.228   0.0000 -0.2411 -0.2027
## age_new26-34                    0.2063     0.0457   4.511   0.0000  0.1145  0.2982
## age_new35-49                   -0.1658     0.0429  -3.868   0.0003 -0.2519 -0.0797
## age_new50-64                   -0.0941     0.0512  -1.838   0.0720 -0.1970  0.0087
## age_new65+                     -0.7821     0.0565 -13.843   0.0000 -0.8955 -0.6686
## demog_sexFemale                 0.6840     0.3188   2.145   0.0368  0.0436  1.3243
## demog_income$20,000 - $49,999  -0.1586     0.0359  -4.419   0.0001 -0.2308 -0.0865
## demog_income$50,000 - $74,999  -0.0846     0.0564  -1.500   0.1399 -0.1978  0.0287
## demog_income$75,000 or more    -0.1303     0.0431  -3.021   0.0040 -0.2170 -0.0437
## alc_agefirst:demog_sexFemale   -0.0432     0.0181  -2.380   0.0212 -0.0796 -0.0067

rbind(
  "alc_agefirst @ Male" = svycontrast_design_df(fit.ex8.3.int,
      c("alc_agefirst"                 = 1),
      EXP = T),      # Exponentiate to get OR
  "alc_agefirst @ Female" = svycontrast_design_df(fit.ex8.3.int,
      c("alc_agefirst"                 = 1,
        "alc_agefirst:demog_sexFemale" = 1),
      EXP = T)
)

##                          est  lower  upper
## alc_agefirst @ Male   0.8010 0.7858 0.8165
## alc_agefirst @ Female 0.7671 0.7461 0.7887

Conclusion: After adjusting for age and income, the association between lifetime marijuana use and age at first use of alcohol significantly differs between sexes (p = .021, the p-value for the interaction term alc_agefirst:demog_sexFemale). For both males and females, those who start using alcohol at an older age have lower odds of lifetime marijuana use, but this effect is stronger (AOR further from 1) in females (AOR = 0.77) than in males (AOR = 0.80).