5 Analysis and formal results: What do the data show?
This section covers methods of analysis, including basic descriptive summaries and model-based approaches, as well as presentation and basic interpretation of results in textual narratives, tables, and figures. By formal results, we mean qualitative and quantitative values and their mathematical interpretations. The next section addresses interpretation of results within the context of public health.
5.1 Describe methods, including software
Principles:
- A report should give enough information about data-analytic methods—including purpose, name, and software implementation—that a knowledgeable reader can confidently trace workflow from data sources through analytic methods to formal results. (Bailar and Mosteller 1988)
- Assumptions behind statistical methods should be provided in sufficient detail to justify, and if possible to replicate, the analysis. Consequences of assumptions should be acknowledged.
- Specify software used for analysis, including the main program(s) and important specialized packages, libraries, or modules. (Compare (Bailar and Mosteller 1988))
Observations:
- In some cases, a report describes its analytic purpose without directly naming the method. For example, mm6802a1 (García et al. 2019) describes the STL procedure without naming it: “smooth temporal trends were statistically separated from annual seasonal components using locally weighted regression”. This example predates the MMWR practice of reporting analytic software.
- Analytic software (e.g., SAS, R, Python, Stata, SUDAAN) is consistently reported since early 2019. In some cases, data management software (e.g., REDCap [mm6920e2 (James et al. 2020), mm6936a5 (Fisher, Tenforde, et al. 2020), mm7021e1 (Gettings et al. 2021), mm705152a3 (Wanga et al. 2021)]) and analytic environments (e.g., Jupyter Notebooks [mm6924e1 (Czeisler, Tynan, et al. 2020)], RStudio [mm7121e1 (Bull-Otterson et al. 2022)]) are also reported.
Recommendations:
- For each substantive method, a report should state an unambiguous name or its general purpose; if space permits, report both. Recommended wording appears with specific examples below.
- Guidance should clarify which software to report, namely, software that is directly relevant to analytic methods: main analytic software and important procedures, packages, or libraries (e.g., GLIMMIX in SAS, randomForest in R, sklearn in Python). Reports need not name computing environments (e.g., Jupyter or RStudio) unless these substantively affect analysis.
5.2 Describe quantities, including outcomes
This section covers summary quantities that can be calculated without a regression (or similar) model. Examples include mean, proportion, rate, ratio, variance, range, and standard error.
Principles:
- Descriptive quantities are properly named.
- Descriptive quantities are formatted with sufficient precision—the number of significant digits and number of decimal places—to offer meaningful distinctions but no more than supported by the resolution of the data. Format precision is consistent across quantities of similar type. (New England Journal of Medicine preparation instructions)
Observations:
- In some cases, the word “rate” is used for a proportion. For example, mm6817a3 (Kariisa et al. 2019) reports the number of deaths per 100,000 population, standardized to the 2000 Census population, as “death rates”; the authors might be implicitly treating the unit as 100,000 person-years. Report mm7037e1 (Scobie et al. 2021) does the same. In contrast, report mm7043e2 (Xu et al. 2021) presents standardized mortality rates as the number of deaths per 100 person-years.
- Some reports use as many as 9 significant digits, typically for count values in large datasets or populations (e.g., mm6802a1 (García et al. 2019), mm6911a5 (Schieber et al. 2020), mm7104e1 (León et al. 2022)). Report mm7121a2 (Sapkota et al. 2022) formats mean weekly counts and corresponding CIs with up to 7 significant digits.
- No reviewed reports regularly format numeric values with 3 or more decimal places (except for P-values, discussed below).
- Some reports have large variation in the number of significant digits, in part because of rounding practices. For example, the table in mm7023e2 (Christie et al. 2021) has a column with values ranging from 0.2 (1 significant digit, 1 decimal place) to 9,008 (4 significant digits, 0 decimal places).
Recommendations:
-
Ensure consistent use of the terms proportion, rate, and ratio, each of which pertains to the result of dividing one number (the numerator) by another (the denominator).
Proportion: The numerator and denominator have the same units, both are typically nonnegative integers, and the numerator is typically smaller than the denominator. A proportion is unitless and typically expressed as a fraction or percentage.
Rate: The numerator and denominator have different units, both are typically nonnegative, the numerator is typically smaller than the denominator, and the denominator unit typically incorporates time or space. A rate is not unitless, e.g., X incident cases per person-year.
Prevalence is typically a proportion, so “prevalence rate” would not make sense. When incidence is a proportion, “cumulative incidence” is a better descriptor. As a rate (e.g., infections per person per year), either “incidence” or “incidence rate” is acceptable.
Ratio: See below.
-
Develop a house style that considers both the number of decimal places when rounding and the number of significant digits.
- The number of significant digits is mathematically of deeper importance and should therefore take primary consideration. In general, we recommend preserving 2-4 significant digits throughout a report. Exceptions might include large, precisely quantified sample and subsample sizes and some official statistics. Derived values should generally be reported with 4 or fewer significant digits.
- The number of decimal places, typically 0-2, should be chosen in a way that generally preserves the intended number of significant digits. Where reported values, especially derived values, vary by 2 or more orders of magnitude, some smaller values should be formatted with additional decimal places. Terminal zeroes should be preserved when consistent with the number of significant digits and number of decimal places.
- Recommendations for formatting ratios, model-based estimates, P-values, and confidence intervals appear below.
5.3 Compare quantities: differences and ratios
Principles: When a set of quantities is to be compared, it is best to contrast them directly, usually as a difference or a ratio, and to assess the contrast. In practice, the difference or ratio can be constructed either empirically, such as by taking the difference of 2 means or the ratio of 2 prevalence estimates, or in a model, such as the difference between 2 coefficients in a regression model. Formal inference procedures, if any, are then applied to the contrast.
Observations:
- Report mm7023e2 (Christie et al. 2021) is an example with at least 4 kinds of contrast in a single table: rate ratios comparing rates between age groups (with confidence intervals) and absolute change in proportion, relative change in rate, and relative change in rate ratio—3 comparisons between time points, with binary significance test results but not CIs.
- Report mm6817a3 (Kariisa et al. 2019) presents absolute and percent changes in rate between time points, with binary significance test results but not CIs.
- Reports mm6827a2 (Su et al. 2019) and mm7039e3 (Budzyn et al. 2021) present measures of primary effects and statistical tests for whether those primary effects differ (beyond statistical variation), but neither report directly quantifies the actual contrasts that are being tested.
- Report mm7041a2 (Bohm et al. 2021) presents model-based, age-standardized prevalence, median, 75th centile, and 90th centile, each accompanied by a 95% CI. Some comparisons between values, such as the prevalence, frequency, and intensity by sex, are described as statistically significant, but the numerical differences between values are not presented, with or without CIs or a measure of the variability of the difference.
- Report mm7121e1 (Bull-Otterson et al. 2022) presents cumulative incidence and point incidence as event rates per 100 person-months. Cases and controls are compared in terms of differences in cumulative incidence (described in the report as absolute risk difference) and incidence rate ratios (RRs). These contrasts are only informally compared between 2 age groups. For example, the report states, “The RR [incidence rate ratio] for cardiac dysrhythmia was significantly higher among patients aged 18–64 years (RR = 1.7) compared with those aged ≥65 years (1.5).” It does not directly quantify the contrast between these ratios, nor does it quantify the variability in the ratio of ratios.
- Report mm7047e1 (DeSisto et al. 2021) analyzes the risk for stillbirth among women with and without Covid-19 at delivery hospitalization, overall and among subsets of mothers with various cooccurring conditions. Adjusted risk ratios for stillbirth are presented in each of 2 time periods (before and during periods where the delta variant predominated), and interaction terms in regression models support formal comparison and significance testing between the RRs before and during the delta-predominant period. The report does not, however, quantify or interpret the interactions. For example, the report states, “During the pre-Delta period [stillbirths involved] 0.98% of deliveries with COVID-19 compared with 0.64% of deliveries without COVID-19 (aRR = 1.47; 95% CI = 1.27–1.71). During the Delta period [stillbirths involved] 2.70% of deliveries with COVID-19 compared with 0.63% of deliveries without COVID-19 (aRR = 4.04; 95% CI = 3.28–4.97). … the risk for stillbirth was significantly higher during the period of Delta predominance than during the pre-Delta period (p<0.001).” The reader is not told, however, that the relative risk was 2.75 times as high during the later period as compared to the earlier period, much less given a CI for the 2.75 interaction term. Nor are any other before-during contrasts formally quantified.
- Report mm7037e1 (Scobie et al. 2021) presents age-standardized incidence rate ratios (IRRs) together with 95% CIs, and it qualitatively compares pairs of IRRs without either a formal significance test or CI. For example, “Age-standardized IRRs for cases in persons not fully vaccinated versus fully vaccinated decreased from 11.1 (95% CI = 7.8–15.8) during [period 1] to 4.6 (95% CI = 2.5–8.5) during [period 2] ….” Although the reader is given a sense of the variability in each individual IRR estimate, the contrast between the paired IRRs is neither quantified nor its variability described.
Recommendations:
Ensure that differences, such as the (absolute) risk difference are labeled, described, and interpreted correctly.
-
Ensure that ratios are labeled, described, and interpreted correctly.
Ratio: one number divided by another, typically capturing a relative measure, such as prevalence ratio, odds ratio, and hazard ratio.
Both a prevalence ratio and an odds ratio can be interpreted qualitatively as indicating that one event is more or less likely than another event. Only a prevalence ratio can be interpreted quantitatively as “Event X is 1.2 times more likely than event Y.” An odds ratio must be interpreted as”Event X has 1.2 times the odds of event Y.”
With odds ratios, it is acceptable to provide a qualitative interpretation of “more likely than” (if the OR > 1) or “less likely than” (if the OR < 1). It is incorrect, however, to provide a quantitative interpretation of an OR using the word “likely”. For example, an OR of 1.3 does not mean that some event is 30% more likely than the contrasting event; rather, it means that the event has 30% greater odds. In limited circumstances, especially with rare events, an OR may be interpreted as approximating a prevalence ratio. When this is done, the approximating interpretation should be explicitly acknowledged. “Because cancer X is rare, we may interpret the OR of 1.04 from our case-control study as indicating that event A occurs about 4% more often [or was about 4% more likely] than event B.”
Where analysis compares quantities, it is best to present differences or ratios directly, even if the components being compared are also presented. The practical significance of a result typically pertains to these quantified contrasts moreso than the values being compared.
Adopt terminology that is unambiguous. The terms “risk ratio” and “relative risk” can be used for several kinds of ratios. A more specific term should be used, such as “prevalence ratio” or “hazard ratio”, where such usage aids clarity.
5.4 Adjust quantities: standardization and regression models
Principles:
- Standardization methods should indicate, and ideally justify, the reference population or distribution used for standardization. In 1998, the Department of Health and Human Services issued a policy (HHS policy for changing the population standard for age adjusting death rates 1998) directing that “the population standard used for age adjusting death rates [be] the year 2000 projected population.”
- Each type of regression model should be sufficiently described and justified. Details might include model form and family (e.g., linear, logistic, loglinear; Gaussian, Poisson, negative binomial), important variations in curve types (e.g., linear, quadratic, joinpoint, segmented, and fractional polynomials), extensions to handle correlation (e.g., generalized estimating equations [GEE] or mixed-effects models, along with their clustering factors and correlation or covariance structures), or other salient features and underlying assumptions.
Observations:
- Many reports in this review standardize mortality and other measures to the age distribution of the 2000 Census (e.g., mm7034e5 (Griffin et al. 2021), mm7037e1 (Scobie et al. 2021), m7041a2); this method is called indirect standardization. In contrast, report mm6911a5 (Schieber et al. 2020) standardized to each of 11 specific years (2008-2018) and report mm7043e2 (Xu et al. 2021) standardized to an internal distribution, called direct standardization.
- A large proportion of the reports that are reviewed here present regression models. Reports generally adequately describe values that are standardized (as in the previous observation) or adjusted through a regression model. Model-based results typically include only elements of focus (outcome and a few predictors or covariates), commenting on ancillary covariates but not tabulating them; this is consistent with preferred practice for selectively repeating only a subset of tabulated values in narrative text. When both unadjusted and adjusted values are presented, reports generally adequately describe them.
Recommendations:
- When standardizing ratios or rates, the reference population should be stated and justified explicitly.
- Describe how continuous predictors are included, including functional form (such as quantized values) and methods for detecting linearity and nonlinearity.
- Describe efforts to select models through iteratively adding or removing predictor terms, and describe the implications for interpreting results.
- Important model parameters should be interpreted correctly and in context. Focus on the main association of interest. Scale model components to aid in interpretation (e.g., age in decades rather than years).
- Develop basic guidance on how to present and interpret common or recurring procedures, such as generalized estimating equations (GEE), joinpoint, and interrupted time series.
- Annual percent (not percentage) change (APC) and its variants, such as the average annual percent change (AAPC), can be calculated empirically or from a loglinear regression model. Therefore, when an APC is reported, the method for calculating it should also be stated.
- Terminology
- Instead of “crude”, use “unadjusted” or “empirical”; these terms correspond to “adjusted” or “model-based”.
- Instead of “modified Poisson regression”, use “Poisson regression with robust variance [or standard error] estimates”.
- Reserve “dose-response” for situations where an intervention is doses. Alternatives include “exposure-response” (when appropriate) and “functional association [or relationship]” where the notion of dose, exposure, or response does not apply.
5.5 Draw conclusions under incomplete information
This section covers the use of statistical significance tests, P-values, and confidence intervals. These are conventional frequentist procedures for drawing inferences from statistical analysis, typically relative to a null hypothesis. We do not cover Bayesian inference here, as Bayesian methods remain rare in MMWR full reports.
Principles: The ASA and ICMJE, as well as other professional organizations and numerous journals, have commented extensively on the use and interpretation of P-values and confidence intervals. In current practice, ASA and ICMJE recommend the following:
“Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. … Proper inference requires full reporting and transparency.” (Wasserstein et al. 2016)
“Distinguish between clinical and statistical significance.” (ICMJE recommendations 2025)
- “Avoid sole reliance on statistical hypothesis testing, such as the use of P-values, which fails to convey important quantitative information.” (Bailar and Mosteller 1988)
- “In general, P-values larger than 0.01 should be reported to 2 decimal places; those between 0.01 and 0.001 to 3 decimal places; and P-values smaller than 0.001 should be reported as P<0.001.” (New England Journal of Medicine preparation instructions) Exceptions exist, such as intention to support multiple comparisons.
- Appeals to nonoverlapping CIs “should not be used for formal significance testing unless the data analyst is aware of its deficiencies and unless the information needed to carry out a more appropriate procedure is unavailable.” (Schenker 2001) This heuristic is an unnecessary and weak shortcut that overemphasizes a binary significance result, against prevailing guidance from professional societies and journals. Thus, the method should be used only in the rare circumstances when there is no feasible alternative; when it is used, it should be strongly justified.
Observations:
- The full reports in this review consistently state the intended level for statistical significance (almost always 0.05) and coverage probability for confidence intervals (almost always 95%).
- The report text usually states the specific methods for constructing significance tests and confidence intervals, or they give information from which a reader can determine how inferential procedures were applied.
- In contrast, tables and figures do not consistently state the specific methods for constructing significance tests and confidence intervals. See the subsequent discussion of tables and figures for specific examples.
- When the slope of a line segment is not statistically significantly different from 0, it might or might not be appropriate to interpret the segment as indicating no appreciable change over time. Reports mm6817a3 (Kariisa et al. 2019) and mm6844a1 (O’Neil et al. 2019) use the word “stable” in this context. For example, “Joinpoint regression examining changes in trends [in rates] were considered to increase if APC >0 (p<0.05) and to decrease if APC <0 (p<0.05); otherwise rates were considered stable[emphasis added].” This conclusion ignores how series can exhibit variation that is noisy or nonlinear but still yield P > 0.05.
- The table in report mm6827a2 (Su et al. 2019) contains cell values marked “NS”, with no further quantitative content, described in the table footnote as “not significantly different from reference group”. Other cells in the same column contain point estimates with 95% confidence intervals.
- The table in report mm6911a5 (Schieber et al. 2020) contains 95% confidence intervals, together with footnote marks indicating “Pearson’s chi-squared test was significant (p<0.001) compared with [a specified referent]”.
- Tables 2 and 3 in report mm6924e1 (Czeisler, Tynan, et al. 2020) report P-values as either “<0.05” with footnote “P-value is statistically significant (p<0.05)” or as a number greater than 0.05 formatted to 4 decimal places.
- Report mm7034e5 (Griffin et al. 2021) incorrectly interprets the Kruskal-Wallis test: “Differences in the percentages of infections by vaccination status were calculated using … Kruskal-Wallis tests for medians. [emphasis added]” Kruskal-Wallis does not, in general, compare medians, unless the shapes of the respective distributions under comparison are also the same. Instead, the Kruskal-Wallis test compares whether one distribution tends to take on lesser or greater values than another.
- Table 1 in report mm705152a2 (Lutrick et al. 2021) formats all P-values to 3 decimal places, ranging from 0.002 to 0.941, with a single P-value formatted as “>0.999”.
- The table in report mm7121a2 (Sapkota et al. 2022) formats mean weekly counts of emergency department visits and their 95% CIs as full precision with 4-7 significant digits and 0 decimal places. Mean counts range from 1,413 to 1,451,717, and CI limits range from 1,356 to 1,463,581.
- Reports mm6817a3 (Kariisa et al. 2019), mm7041a2 (Bohm et al. 2021), and mm7121e1 (Bull-Otterson et al. 2022) all appeal to nonoverlapping CIs.
- Several reports include dozens or hundreds of simultaneous confidence intervals or significance tests.
Recommendations: Since the criticized practices are deeply entrenched, these recommendations focus primarily on placing inferential procedures in context and emphasizing other aspects of drawing conclusions and reporting them. The following recommendations pertain to both significance tests and CIs.
- Follow ICMJE and ASA guidance on use of inferential procedures
- Avoid emphasizing binary conclusions.
- Prefer confidence intervals or other interval estimates over P-values.
- Don’t conclude absence of an effect when it is not “significant” (e.g., “stable” for slope not significantly different from 0), unless the conclusion also appeals to adequate statistical power or other justification.
- It is acceptable to report both a CI and a P-value. If only 1 is reported,
favor the CI. Not every method admits an interpretable CI, such as Fisher
and Pearson tests for contingency tables.
- When reporting both a CI and a significance test result, it is acceptable to state whether or not the test is statistically significant at a specified level. It is better to report both the CI and P-value.
- Ensure that the interpretation of a P-value or CI is correct. Both the
P-value and the CI are calculated based on a specific dataset and can
therefore be interpreted as characterizing results from data relative to a
hypothesis—whether the observed data are compatible with that hypothesis—not
whether a given hypothesis is supported, proven, or disproven.
- In general practice, a significance test result that is negative may not be interpreted as indicating the absence of a meaningful effect. For example, a slope parameter that is not statistically significantly different from 0 may not be interpreted as indicating being constant, flat, or stable, unless additional assumptions or analyses are presented to support the stronger claim.
- A test or a CI for a comparison should be applied directly to the comparison itself, such as a difference or ratio.
- Appeals to overlapping or nonoverlapping CIs should be reserved for exceedingly rare cases where they are the only option.
- When a report includes a large number of significance test results or CIs, address the potential for spurious significance, sometimes called P-hacking. As with other inferential procedures, multiple simultaneous procedures should be interpreted for both their public health significance and in the context of statistical variability and potential bias.
Recommendations specific to significance testing: The following recommendations pertain to significance testing and P-values.
- Mention the testing procedure by its correct, unambiguous name on the first
occurrence of a P-value from that procedure.
- Since there are several chi-squared tests, ensure that tests for independence or homogeneity specify “Pearson” or “Pearson’s”, as in “Pearson’s chi-squared test” (both of which are preferable to “Pearson’s X2 [or 𝜒2] test”).
- A P-value should typically be reported to 2 significant digits unless its
value is less than 0.001. Specifically, for P > 0.10, report no more than
2 decimal places, and for P strictly between 0.001 and 0.10, report no
more than 3 decimal places unless a result is meant to support adjustment
for simultaneous inferences.
- Never report only “P < 0.05” for a specific text result.
- Never report merely that a result was “nonsignificant” (nor “NS”); always report a P-value or the intended test size.
- If a result is described as statistically significant or nonsignificant, then state the size of the test, also known as type 1 error probability or “alpha”. In MMWR reports, 0.05 is the most common. For example, “X has a P-value of 0.03, which is statistically significantly different from 0 at the 0.05 level.”
- Any test that is not 2-sided or 2-tailed should be explicitly described and justified. It is conventional to halve the significance level for 1-tailed tests (e.g., 0.025 when the 2-tailed test would be at 0.05 significance).
- Never describe a test result whose P-value slightly exceeds the type 1 value as a “trend” or “approaching significance”. Reserve “trend” for actual tests of trend (e.g., Cochran-Armitage or slope of a regression). (No reports in our review used the word “trend” as a hedge on statistical significance.)
- When comparing continuous distributions, one may use the nonparametric Wilcoxon (or Mann-Whitney) and Kruskal-Wallis methods with normal or nonnormal data. It is incorrect to describe these tests as comparing medians without further assumptions.
Recommendations specific to confidence intervals: The following recommendations pertain to CIs.
- Mention the CI construction procedure by name on the first occurrence of a CI from that procedure.
- Format CI endpoints with the same precision as the corresponding point estimate. MMWR house style the lower and upper confidence limits in parentheses, delimited by an en dash when both values are nonnegative, as “(1.5–2.5)”, or the word to when the lower limit is negative, as “(−2.5 to −1.5)”. All CIs in a table should use the same format. (See an example of the latter in report mm6911a5 (Schieber et al. 2020).)
- State the confidence level (i.e., intended coverage probability). In MMWR reports, 95% coverage is the most common.
5.6 Tables
Principles:
- Tables in a main report should balance density against utility.
- Data and analytic concepts are described with enough detail that the table can stand apart from its report. Analytic and inferential procedures are correctly named, often in a table’s footnote.
- Within a table, a reader should be able to cross-reference cells that are compared, e.g., with row or column percentages.
Observations:
- Several tables densely format a large number of numeric values or footnote
marks.
- Table 2 in report mm7041a2 (Bohm et al. 2021) contains over 1,500 numbers in 528 cells (88 rows and 6 columns of point estimates with 95% CIs). Table 2 in report mm7104e1 (León et al. 2022) contains 930 numbers (62 rows and 5 columns of point estimates with 95% CIs).
- The table in report mm6817a3 (Kariisa et al. 2019) contains 250 cells (among 66 rows and 4 columns) to which a significance test is applied, of which 184 have a footnote mark “†††”. Table 2 in report mm6932a1 (Czeisler, Lane, et al. 2020) contains 112 cells to which a significance test is applied, of which 74 have a footnote mark “**”. Table 2 in report mm6932a1 (Czeisler, Lane, et al. 2020) and the table in report mm7023e2 (Christie et al. 2021) contain a large number of cells with footnote marks.
- The table in report mm6844a1 (O’Neil et al. 2019), table 2 in report mm6903a1 (Peterson et al. 2020), and the table in report mm6911a5 (Schieber et al. 2020) present both a large number of table cells and a large number of cells with footnote marks.
- Tables rarely specify the name of significance tests, with a few notable
exceptions.
- Table 2 in report mm6802a1 (García et al. 2019) indicates “multiplicity-adjusted Wald tests” and annotates some 95% CIs as “p < 0.05”, “p < 0.01”, or “p < 0.001”.
- The table in report mm6817a3 (Kariisa et al. 2019) reports p “< 0.05” based on nonoverlapping CIs. (See separate comment about nonoverlapping CIs.)
- The table in report mm6911a5 (Schieber et al. 2020) annotates 95% CIs for which “Pearson’s chi-squared test was significant (p<0.001) compared with [referent]”. The same table also annotates when the average annual percentage change (AAPC) “was significantly different from zero at the alpha = 0.05 level”, but the footnote does not indicate how AAPC point estimates or CI limits were calculated.
- The footnotes to tables 2 and 3 in report mm6924e1 (Czeisler, Tynan, et al. 2020) indicate that reported P-values were “calculated with Chi-squared test of independence”.
- The footnote to table 1 in mm6930e1 (Tenforde et al. 2020) states that P-values are based on comparisons “using the chi-square test or Fisher’s exact test.”
- No tables fully describe methods for CIs, though many partially describe
regression procedures.
- Among 17 tables that report significance test results but not CIs, 2 report only dichotomous results, 2 (in report mm6924e1 (Czeisler, Tynan, et al. 2020)) report 4 digits if P>0.05 but “<0.05” otherwise, and 13 report P with 2-3 digits. Among these 17 tables, 12 state the test method and 5 do not.
- Among 21 tables that report CIs but not significance tests, 5 report the CI method, and 16 do not.
- Among 21 tables that report both significance tests and CIs, 5 report P with 2-3 digits, 14 report dichotomous test results with stated significance level, 1 (mm6844a1 (O’Neil et al. 2019)) reports dichotomous results without a stated significance level, and 1 (mm6802a1 (García et al. 2019)) reports a mix of quantified P-values and dichotomous test results with stated significance. The CI methods are stated for 1 table out of 21: mm7023e2 (Christie et al. 2021), which indicates a parametric bootstrap, but not which parametric model.
- Reports mm6802a1 (García et al. 2019), mm6844a1 (O’Neil et al. 2019), and mm6911a5 (Schieber et al. 2020) include tables that present APCs without indicating the source or method for APC estimates. Report mm6906a3 (Divers et al. 2020) table partially states the APC method.
Recommendations: These recommendations pertain to tables as a whole. See the recommendation specific to numerical format elsewhere in this report.
- Balance table density against utility.
- A sparse table might work better as prose in lieu of a table.
- A dense, large, or complex table should be thinned out, simplified, or otherwise reduced in complexity, so that the reader knows what to focus on and how to connect the results in the table with the overall narrative, while preserving the transparency and integrity of the analysis.
- In some cases of complex data, a figure might be more suitable than a table.
- Where the full extent of a dense, large, or complex table is important—for example, as comprehensive documentation or as an official record—place the table in a supplement and alert the reader to the value and location of the more extensive tabulation.
- Always name inferential procedures (how Ps and CIs were obtained),
typically in a footnote.
- It is preferred to state whether inferences are adjusted for multiple comparisons or simultaneous inferences. If it is stated, then the method must be named (e.g., Bonferroni or Holm).
- Where contrasts are presented, explicitly note the reference value or category for the contrast. These often appear in table cells for categorical constructs with more than 2 values. For binary or dichotomous constructs, the referent may be stated in a label or a table footnote rather than its own table cell. For example, where 2 genders are analyzed and a contrast value is labeled as “female”, it should be noted that the contrast is against “male”.
5.7 Figures
Principles:
- Figures follow modern visualization practices for statistical graphics, which include “Maximize the data-ink ratio” and “Forgo chartjunk.” (Tufte 2018)
- Bar charts, dot plots, and other graphic elements, with or without error bars, are used based in part on empirical research regarding their effectiveness. ((Cleveland 1984); (Correll and Gleicher 2014), (Irizarry 2019); (Kerns and Wilmer 2021); (Pentoney and Berger 2016))
- Axis scales and superposed elements are principled ((Ellis 2016); (Few 2008); (Healy 2016); (Knaflic 2016); (Muth 2018); (Vigen 2024))
- As with other figure elements, data and analytic concepts are described with enough detail that the figure can stand apart from its report. In particular, analytic and inferential procedures are correctly named, often in a figure’s footnote.
Observations: This section summarizes salient characteristics of figures in the reviewed reports. Appendix 1 contains detailed commentary on all 63 figures in the 41 reports that contain figures.
- 6 figures use dual-scale axes, meaning that the scale on one side differs from the scale on the other.
- 10 figures use broken axes, meaning that an axis has a visible gap meant to replace an even larger gap, as a way of reducing the total extent if the axis were unbroken. In most cases, the breaks allow the range of percentages to include 100, although not all figures with percentage scales include 100, and some other graphic types include breaks.
- 2 figures are bar charts without intervals.
- 3 figures are bar charts with intervals, also known as dynamite charts.
- 14 figures exemplify issues of scale, including the extent of the axis relative to the extent of depicted data, the overall use of space, and the use of transformations like the logarithm.
- 14 figures exemplify the use of visual reference features that do not directly depict data elements, including frames around graphic panels and fields as well as grids and other guides.
- 6 figures use, or could use, scatterplot smooths.
- 10 figures depict several intervals, either around dots or around points on a line graph. In most cases, a single interval around a dot need not have end caps (see, e.g., mm7018e1 (Tenforde et al. 2021)). Where the intervals appear at a sequence of locations on a line graph, the set of intervals could be replaced by shading a polynomial that covers the extent from the lower limits of the intervals (connected dot to dot) to the upper limits. Then the line graph appears in the interior of the pointwise confidence region. Overlap between shaded polygons can appear as a blended shade.
- 7 figures contain partial information on estimation or inferential procedures.
Recommendations:
- Maximize the data-ink ratio and forgo chartjunk.
- Render panels with 1 or 2 axes rather than 4-sided frames.
- Use bar charts sparingly and correctly. (See further details below.)
- Avoid using grids. If grids or other visual cues are used, such as a vertical line for visual reference, then deemphasize these elements relative to data elements. Typical methods include using thinner lines or lighter shades. Ensure that these elements appear in the background, with data elements in the foreground.
- Allow direct labeling of graphic components as an alternative to a legend.
- Adapt the vertical and horizontal extents of a graphic to the scale of the data. (See further details below.)
- Always name inferential procedures to say how Ps and CIs were obtained, typically in a footnote.
- Favor dot plots over bar charts.
- Reserve bar charts for counts or proportions, such as histograms and counts of incident events by date.
- Where a graphic element, especially a dot plot, is presented with intervals,
it is typically unnecessary to include end caps on the intervals.
- An exception: intervals where additional values outside intervals may appear as dots, as with conventional box-and-whisker plots.
- Where line graphs have multiple, pointwise intervals, use polygons with shading, giving the appearance of the main line graph appearing within a field between 2 outer line graphs.
- Adapt the vertical and horizontal extents of a graphic to the scale of the
data. Limiting axis extent helps to make better use of available space and
reader focus.
- Allow axis extents to follow the empirical range of graphed elements more closely and on useful scales. For example, if graphed values extend vertically between 49 and 71, then it would be appropriate to construct a vertical axis from 50 to 70. The same concept applies to horizontal extents.
- It is not necessary to include an origin or reference value (e.g., 0 or
- or maximum value (e.g., 100%) if the graphic can otherwise be constructed honestly without those reference values. Use a broken axis rarely.
- Use nonlinear scales where it makes sense to do so, but label axes or values on the original scale. For example, odds ratios, prevalence ratios, hazard ratios, and some chemical or pathogen concentrations can often be presented on a logarithmic scale.
- A graphic should almost never contain components on multiple scales.
- If a graphic is presented with multiple scales, then the relationship between those scales should be carefully planned to avoid extorting data or relationships between data elements.
- If a graphic is presented with multiple scales, then ensure that superposed elements are not obscured.
- When graphics contain multiple panels, ensure that a reader can relate the scales between components. For example, apply the same horizontal and vertical extents to each component, unless doing so obscures visual elements.
- Assure that graphics render well in grayscale. Avoid hash patterns.
- Promote use of superposition between scatterplots and smooths (e.g., segmented curves) or between different smooths (e.g., loess and segmented curves).