Chapter 6 Exploring Crime in Florida

My company has chosen me to investigate and uncover what socioeconomic factors are most strongly associated with rising crime rates across Florida counties. I’m super excited!

6.1 Loading and Preparing the Data

Now I’m going to load and clean the data.

library(readxl)
florida_crime <- read_excel("data/Florida County Crime Rates.xlsx")
head(florida_crime)

## # A tibble: 6 × 5
##   County       C     I    HS     U
##   <chr>    <dbl> <dbl> <dbl> <dbl>
## 1 ALACHUA    104  22.1  82.7  73.2
## 2 BAKER       20  25.8  64.1  21.5
## 3 BAY         64  24.7  74.7  85  
## 4 BRADFORD    50  24.6  65    23.2
## 5 BREVARD     64  30.5  82.3  91.9
## 6 BROWARD     94  30.6  76.8  98.9

6.1.1 Renaming columns and formatting county names

library(tidyverse)
florida_crime <- florida_crime %>%
  rename(
    Crime = "C",
    Income = "I",
    HighSchoolGrad = "HS",
    UrbanPop = "U"
  )
florida_crime

## # A tibble: 67 × 5
##    County    Crime Income HighSchoolGrad UrbanPop
##    <chr>     <dbl>  <dbl>          <dbl>    <dbl>
##  1 ALACHUA     104   22.1           82.7     73.2
##  2 BAKER        20   25.8           64.1     21.5
##  3 BAY          64   24.7           74.7     85  
##  4 BRADFORD     50   24.6           65       23.2
##  5 BREVARD      64   30.5           82.3     91.9
##  6 BROWARD      94   30.6           76.8     98.9
##  7 CALHOUN       8   18.6           55.9      0  
##  8 CHARLOTTE    35   25.7           75.7     80.2
##  9 CITRUS       27   21.3           68.6     31  
## 10 CLAY         41   34.9           81.2     65.8
## # ℹ 57 more rows

library(tidyverse)
library(skimr)

## 
## Attaching package: 'skimr'

## The following object is masked from 'package:mosaic':
## 
##     n_missing

florida_crime <- florida_crime %>%
  mutate (County = str_to_title(County))
florida_crime

## # A tibble: 67 × 5
##    County    Crime Income HighSchoolGrad UrbanPop
##    <chr>     <dbl>  <dbl>          <dbl>    <dbl>
##  1 Alachua     104   22.1           82.7     73.2
##  2 Baker        20   25.8           64.1     21.5
##  3 Bay          64   24.7           74.7     85  
##  4 Bradford     50   24.6           65       23.2
##  5 Brevard      64   30.5           82.3     91.9
##  6 Broward      94   30.6           76.8     98.9
##  7 Calhoun       8   18.6           55.9      0  
##  8 Charlotte    35   25.7           75.7     80.2
##  9 Citrus       27   21.3           68.6     31  
## 10 Clay         41   34.9           81.2     65.8
## # ℹ 57 more rows

skim(florida_crime)

Table 6.1: Data summary
Name	florida_crime
Number of rows	67
Number of columns	5
_______________________
Column type frequency:
character	1
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
County	0	1	3	9	0	67	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Crime	1	52.40	28.19	0.0	35.50	52.0	69.00	128.0	▃▇▇▃▂
Income	1	24.51	4.68	15.4	21.05	24.6	28.15	35.6	▂▇▅▅▂
HighSchoolGrad	1	69.49	8.86	54.5	62.45	69.0	76.90	84.9	▇▇▆▇▆
UrbanPop	1	49.56	33.97	0.0	21.60	44.6	83.55	99.6	▅▆▂▃▇

summary(florida_crime)

##     County              Crime           Income      HighSchoolGrad 
##  Length:67          Min.   :  0.0   Min.   :15.40   Min.   :54.50  
##  Class :character   1st Qu.: 35.5   1st Qu.:21.05   1st Qu.:62.45  
##  Mode  :character   Median : 52.0   Median :24.60   Median :69.00  
##                     Mean   : 52.4   Mean   :24.51   Mean   :69.49  
##                     3rd Qu.: 69.0   3rd Qu.:28.15   3rd Qu.:76.90  
##                     Max.   :128.0   Max.   :35.60   Max.   :84.90  
##     UrbanPop    
##  Min.   : 0.00  
##  1st Qu.:21.60  
##  Median :44.60  
##  Mean   :49.56  
##  3rd Qu.:83.55  
##  Max.   :99.60

6.2 Exploratory Data Analysis

Descriptive statistics for florida_crime.

summary(florida_crime)

##     County              Crime           Income      HighSchoolGrad 
##  Length:67          Min.   :  0.0   Min.   :15.40   Min.   :54.50  
##  Class :character   1st Qu.: 35.5   1st Qu.:21.05   1st Qu.:62.45  
##  Mode  :character   Median : 52.0   Median :24.60   Median :69.00  
##                     Mean   : 52.4   Mean   :24.51   Mean   :69.49  
##                     3rd Qu.: 69.0   3rd Qu.:28.15   3rd Qu.:76.90  
##                     Max.   :128.0   Max.   :35.60   Max.   :84.90  
##     UrbanPop    
##  Min.   : 0.00  
##  1st Qu.:21.60  
##  Median :44.60  
##  Mean   :49.56  
##  3rd Qu.:83.55  
##  Max.   :99.60

By using the summary function (twice) we can take a look at the mean, median, and range of the dataset.

Mean Crime = 52.4, Mean Income = 24.15, Mean HighSchoolGrad = 69.49, Mean UrbanPop = 49.56 Median Crime = 52, Median Income = 24.60, Median HighSchoolGrad = 69, Median UrbanPop = 44.60 Range Crime = 0-128, Range Income = 15.40-35.60, Range HighSchoolGrad = 54.50-84.90, Range UrbanPop = 0-99.60

6.3 Visualizing the data in all its glory

viz1 <- ggplot(florida_crime, aes(x=Income, y=Crime))+
  geom_point(size=2.5)+
  geom_smooth(method = "lm", se=FALSE)+
  labs(
    title = "Income vs Crime",
    x = "Income",
    y = "Crime"
  )
print(viz1)

## `geom_smooth()` using formula = 'y ~ x'

Figure 6.1: This graph shows how crime varies by income.

Looking at the slope line, the graph shows that as Income increases Crime also increases.

viz2 <- ggplot(florida_crime, aes(x=HighSchoolGrad, y=Crime))+
  geom_point(size=2.5)+
  geom_smooth(method = "lm", se=FALSE)+
  labs(
    title = "School vs Crime",
    x = "School",
    y = "Crime"
  )
print(viz2)

## `geom_smooth()` using formula = 'y ~ x'

Figure 6.2: This graph shows hows crime varies by education.

Here we also see a similar trend from the previous graph. As school graduation increases so does Crime.

viz3 <- ggplot(florida_crime, aes(x=UrbanPop, y=Crime))+
  geom_point(size=2.5)+
  geom_smooth(method = "lm", se=FALSE)+
  labs(
    title = "Urban Lifestyle vs Crime",
    x = "UrbanPop",
    y = "Crime"
  )
print(viz3)

## `geom_smooth()` using formula = 'y ~ x'

Figure 6.3: This graph shows how crime varies by population.

Again, there is a similar pattern as the two previous graphs. As the percentage of the urban population increases so does Crime.

Let’s use patchwork for better readability.

library(patchwork)

## 
## Attaching package: 'patchwork'

## The following object is masked from 'package:MASS':
## 
##     area

viz1+viz2+viz3

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Figure 6.4: Side-by-side comparison of the three scatterplots.

6.4 Correlation Analysis

library(tidyverse)
Numeric_florida_crime <- florida_crime %>% 
  dplyr::select(Crime, Income, HighSchoolGrad, UrbanPop)
head(Numeric_florida_crime)

## # A tibble: 6 × 4
##   Crime Income HighSchoolGrad UrbanPop
##   <dbl>  <dbl>          <dbl>    <dbl>
## 1   104   22.1           82.7     73.2
## 2    20   25.8           64.1     21.5
## 3    64   24.7           74.7     85  
## 4    50   24.6           65       23.2
## 5    64   30.5           82.3     91.9
## 6    94   30.6           76.8     98.9

library(Hmisc)

## Warning: package 'Hmisc' was built under R version 4.5.2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

Cor_Max <- rcorr(as.matrix(Numeric_florida_crime))
Cor_Max

##                Crime Income HighSchoolGrad UrbanPop
## Crime           1.00   0.43           0.47     0.68
## Income          0.43   1.00           0.79     0.73
## HighSchoolGrad  0.47   0.79           1.00     0.79
## UrbanPop        0.68   0.73           0.79     1.00
## 
## n= 67 
## 
## 
## P
##                Crime Income HighSchoolGrad UrbanPop
## Crime                2e-04  0e+00          0e+00   
## Income         2e-04        0e+00          0e+00   
## HighSchoolGrad 0e+00 0e+00                 0e+00   
## UrbanPop       0e+00 0e+00  0e+00

There is a positive correlation between Income and Crime r = 0.43 There is a positive correlation between HighSchoolGrad and Crime r= 0.47. There is a positive correlation between UrbanPop and Crime r = 0.68 (the strongest correlation among the variables).

6.5 Building Regression Models

m1 <- lm(Crime ~ Income, data = Numeric_florida_crime)
summary(m1)

## 
## Call:
## lm(formula = Crime ~ Income, data = Numeric_florida_crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.452 -21.347  -3.102  17.580  69.357 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11.6059    16.7863  -0.691 0.491782    
## Income        2.6115     0.6729   3.881 0.000246 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.6 on 65 degrees of freedom
## Multiple R-squared:  0.1881, Adjusted R-squared:  0.1756 
## F-statistic: 15.06 on 1 and 65 DF,  p-value: 0.0002456

AIC(m1)

## [1] 628.6045

m2 <- lm(Crime ~ HighSchoolGrad, data = Numeric_florida_crime)
summary(m2)

## 
## Call:
## lm(formula = Crime ~ HighSchoolGrad, data = Numeric_florida_crime)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -43.74 -21.36  -4.82  17.42  82.27 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -50.8569    24.4507  -2.080   0.0415 *  
## HighSchoolGrad   1.4860     0.3491   4.257 6.81e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.12 on 65 degrees of freedom
## Multiple R-squared:  0.218,  Adjusted R-squared:  0.206 
## F-statistic: 18.12 on 1 and 65 DF,  p-value: 6.806e-05

AIC(m2)

## [1] 626.0932

m3 <- lm(Crime ~ UrbanPop, data = Numeric_florida_crime)
summary(m3)

## 
## Call:
## lm(formula = Crime ~ UrbanPop, data = Numeric_florida_crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.766 -16.541  -4.741  16.521  49.632 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24.54125    4.53930   5.406 9.85e-07 ***
## UrbanPop     0.56220    0.07573   7.424 3.08e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.9 on 65 degrees of freedom
## Multiple R-squared:  0.4588, Adjusted R-squared:  0.4505 
## F-statistic: 55.11 on 1 and 65 DF,  p-value: 3.084e-10

AIC(m3)

## [1] 601.43

6.6 Multiple regression models

m4 <- lm(Crime ~ Income + UrbanPop, data = Numeric_florida_crime)
summary(m4)

## 
## Call:
## lm(formula = Crime ~ Income + UrbanPop, data = Numeric_florida_crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.130 -15.590  -6.484  16.595  48.921 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.9723    16.3536   2.444   0.0173 *  
## Income       -0.7906     0.8049  -0.982   0.3297    
## UrbanPop      0.6418     0.1110   5.784 2.36e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.91 on 64 degrees of freedom
## Multiple R-squared:  0.4669, Adjusted R-squared:  0.4502 
## F-statistic: 28.02 on 2 and 64 DF,  p-value: 1.815e-09

AIC(m4)

## [1] 602.4276

m5 <- lm(Crime ~ Income + HighSchoolGrad, data = Numeric_florida_crime)
summary(m5)

## 
## Call:
## lm(formula = Crime ~ Income + HighSchoolGrad, data = Numeric_florida_crime)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42.75 -19.61  -4.57  18.52  77.86 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    -46.1094    24.9723  -1.846   0.0695 .
## Income           1.0311     1.0839   0.951   0.3450  
## HighSchoolGrad   1.0540     0.5729   1.840   0.0705 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.14 on 64 degrees of freedom
## Multiple R-squared:  0.2289, Adjusted R-squared:  0.2048 
## F-statistic:   9.5 on 2 and 64 DF,  p-value: 0.000244

AIC(m5)

## [1] 627.1524

m6 <- lm(Crime ~ Income + UrbanPop + HighSchoolGrad, data = Numeric_florida_crime)
summary(m6)

## 
## Call:
## lm(formula = Crime ~ Income + UrbanPop + HighSchoolGrad, data = Numeric_florida_crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.407 -15.080  -6.588  16.178  50.125 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     59.7147    28.5895   2.089   0.0408 *  
## Income          -0.3831     0.9405  -0.407   0.6852    
## UrbanPop         0.6972     0.1291   5.399 1.08e-06 ***
## HighSchoolGrad  -0.4673     0.5544  -0.843   0.4025    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.95 on 63 degrees of freedom
## Multiple R-squared:  0.4728, Adjusted R-squared:  0.4477 
## F-statistic: 18.83 on 3 and 63 DF,  p-value: 7.823e-09

AIC(m6)

## [1] 603.6764

6.7 Comparing models

m1, R-squared of 0.188, Adjusted R-squared of 0.176, and AIC [628.6045] (variance explained is 18%).

m2, R-squared of 0.218, Adjusted R-squared of 0.206, and AIC [626.0932] (variance explained is 21%).

m3, R-squared of 0.459, Adjusted R-squared of 0.451, and AIC [601.43] (variance explained is 45%).

m4, R-squared of 0.467, Adjusted R-squared of 0.451, and AIC [602.4276] (variance explained is 45%).

m5, R-squared of 0.229, Adjusted R-squared of 0.205, and AIC [627.1524] (variance explained is 20%).

m6, R-squared of 0.473, Adjusted R-squared of 0.447, and AIC [603.6764] (variance explained is 45%).

6.8 Reflections

I can see that m3, m4, and m6 have the highest R-squared and Adjusted R-squared compared to m1,m2, and m5. Also, m3 has the lowest AIC which makes this model less complex when explaining the variance between the variables. I can also see that m4 and m6 have the lowest AIC following m3. m3 is more balance when it comes to R-squared and Adjusted R-squared. It’s safe to say that m3 is the best model for accuracy and simplicity.

6.9 Conclusion

After cleaning and analyzing the data by comparing different models for better crime predictions, I am confident about recommending model 3 (Crime and Urban Population) as it the most influential predictor when it comes to crime. This model explains 45% of the variance between the dependent variable (Crime) and the independent variable (UrbanPop). The PD should focus on highly populated areas to provide a sense of security to the population, which can help reduce the possibilities of committing crimes. One limitation is that correlation is not causation, and besides we don’t really have much data to look into age as a potential variable to understand what age groups are more likely to engage in such activities and devise better plans for interventions.