1 Prepare data

Objectives

The goal of this chapter is to prepare the dataset(s) I would need for my RSB Shiny dashboard. To prepare the data I need to follow several steps:

Get RWB datasets for all years from the RWB website (Section 1.1).
Get the UNSD classification system using the M49 methodology. (Section 1.3)

1.1 Get RWB data

The first task is to download all datasets from the RWB website.

There is no place where I can get an integrated harmonized dataset. The only way I found out is to go to every year’s index and to use the button “Download this index”. Fortunately the link stem and the structure of the file name is the same for all years.

Example: https://rsf.org/sites/default/files/import_classement/2025.csv.

So I just have to change the year of the CSV file name.

R Code 1.1 : Get RWB datasets

Run this code chunk manually if the file(s) still needs to be downloaded.

Listing / Output 1.1: Download and store datasets from RWB website

Code

base::source(file = "R/helper.R")
url <- "https://rsf.org/sites/default/files/import_classement/"

rsf_year <- function(year) {
    rsf_name <- paste0(url, year, ".csv")
    readr::read_delim(rsf_name,
                      delim = ";", 
                      escape_double = FALSE, 
                      trim_ws = TRUE,
                      show_col_types = FALSE)
}


my_year <- list(
    "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010",
    # 2011 is missing
    "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019", "2020",
    "2021", "2022", "2023", "2024", "2025")

for (i in 1:length(my_year)) {
    my_name <- paste0("rsf", my_year[[i]])
    my_save_data_file("chap011/rsf",
                      assign(my_name,  rsf_year(my_year[[i]])),
                      paste0(my_name, ".rds")
    )
}

(For this R code chunk is no output available)

The side effect of Listing / Output 1.1 is a collection of data files from 2002 to 2025 (2011 missing) with the structure of rsf<year>.rds.

1.2 Clean RWB data

A manual inspections of the RSF website and of the downloaded dataset reveals three different structures of RWB datasets. My aim is to create one dataset for the whole time range 2012-2025. Therefore I have to compare the structure of the different dataset.

The first step is to load all datasets into the R memory:

R Code 1.2 : Load all datasets into R memory

Run this code chunk manually if the file(s) still needs to be loaded into memory

Listing / Output 1.2: Load all saved RWB datasets (2002-2025, 2011 missing) into R memory

Code

base::source(file = "R/helper.R")
my_get_dir_files("data/chap011/rsf", "\\.rds$")

(For this R code chunk is no output available. The function has the side effect that ll files with the specified path and file extensions are loaded into the R memory.)

1.2.1 Batch 1: 2022-2025

1.2.1.1 Compare structure

Data frames of the years 2022-2025 contain the most complete datasets. From 2022 onwards there is a new Methodology used for compiling the World Press Freedom Indices worked out by panel of experts.

They have slightly different column numbers and column names:

R Code 1.3 : Compare structure of the datasets 2022-2025

Listing / Output 1.3: Compare the structure of the datasets 2022-2025 by using the janitor::compare_df_cols() function

Code

janitor::compare_df_cols(rsf2025, rsf2024, rsf2023, rsf2022)

#>          column_name   rsf2025   rsf2024   rsf2023   rsf2022
#> 1         Country_AR character character character character
#> 2         Country_EN character character character character
#> 3         Country_ES character character character character
#> 4         Country_FA character character character character
#> 5         Country_FR character character character character
#> 6         Country_PT character character character      <NA>
#> 7   Economic Context   numeric   numeric   numeric   numeric
#> 8                ISO character character character character
#> 9      Legal Context   numeric   numeric   numeric   numeric
#> 10 Political Context   numeric   numeric   numeric   numeric
#> 11              Rank   numeric   numeric   numeric   numeric
#> 12    Rank evolution   numeric   numeric   numeric   numeric
#> 13          Rank N-1   numeric   numeric   numeric   numeric
#> 14          Rank_Eco   numeric   numeric   numeric   numeric
#> 15          Rank_Leg   numeric   numeric   numeric   numeric
#> 16          Rank_Pol   numeric   numeric   numeric   numeric
#> 17          Rank_Saf   numeric   numeric   numeric   numeric
#> 18          Rank_Soc   numeric   numeric   numeric   numeric
#> 19            Safety   numeric   numeric   numeric   numeric
#> 20             Score      <NA>   numeric   numeric   numeric
#> 21        Score 2025   numeric      <NA>      <NA>      <NA>
#> 22   Score evolution character character character   numeric
#> 23         Score N-1   numeric   numeric   numeric   numeric
#> 24         Situation      <NA> character      <NA>      <NA>
#> 25    Social Context   numeric   numeric   numeric   numeric
#> 26          Year (N)   numeric   numeric   numeric   numeric
#> 27              Zone character character character character

The table in Listing / Output 1.3 shows the differences with <NA> values:

Line 06: In 2022 are the country names in Portuguese missing. But this is not relevant for my use case as I will just use Country_EN.
Line 20 and 21: The column Score 2025 in the 2025 dataset has to be renamed to Score to get a general name for all datasets and to match the other datasets.
Line 22 and 23: Score evolution and the score of the previous year (Score N-1) is missing in the data of 2022. The reason is that in 2021 the score calculation followed a different methodology.
Line 24: There is only in 2024 a column judging the absolute score values with five predicates (in French): Bonne situation, Situation plutôt bonne, Situation problématique, Situation difficile, Situation très grave. The values for the classifications has changed with 2022 and are documented (in English) in the already mentioned article on Methodology 2022-2025.

Important 1.1: Methodological considerations

Even if the years 2022-2025 and 2013-2021 have the same scale (0-100 points) these scales measure different factors of press freedom:

2022-2025: global score, political context, economic context, legal context, social context and safety.
2013-2021: global score, pluralism, media independence, environment and self-censorship, legislative framework, transparency, infrastructure, abuses. (Only the global score is reported.)

There is no problem to compare ranks over these two different measure methods. But a comparison of the values is critical. To understand why this is the case, imagine the comparisons of cars: In one period we measure their maximum speed, in the other period we measure the economy of their fuel consumption. Even if we judge both measures with a scale of 0-100 they are not comparable.

But in our case, we have not measured different outcome / products but we have used different indicators to measure the same thing, namely press freedom. I believe therefore that it is feasible to compare the global score of the different measurements. To use our car example: In one period we measure the economy of the fuel consumption on the highway with a speed of 100 km/h, in the other period we compare it with a combined measure of driving in the city, on country roads and highways. Both methods gives us a measure about the economy of the fuel consumption, even if the components of the global score differ.

As the RWB methods in both periods used the same scale (0 for the worst, 100 for the best), I think it is legitimate to compare the differences between 2022 (the first year with the new method) with 2021 (the last year with the previous method). This reflection concerns the two missing columns in 2022: Score evolution and Score N-1. To bind the rows together we need to add these two columns with the appropriate values to 2022.

In graphs comparing global score values for years with different methods it would be helpful to signal this distinction with different colors, line types etc. in the same figure.

1.2.1.2 Adapt 2022 dataset

R Code 1.4 : Add Score N-1 and Score evolution to RWB dataset 2022

Run this code chunk manually if the 2022 still needs to be adapted

Listing / Output 1.4: Add Score N-1 from 2021 and compute Score evolution to RWB dataset 2022

Code

rsf2021_short <- readRDS(paste0(here::here(), "/data/chap011/rsf/rsf2021.rds")) |> 
    dplyr::select(c(ISO, `Score N`)) |> 
    dplyr::rename(`Score N-1` = `Score N`)

rsf2022 <- readRDS(paste0(here::here(), "/data/chap011/rsf/rsf2022.rds"))

## check if columns for rsf2022 are already present
if (!"Score evolution" %in% names(rsf2022)) {
  rsf2022 <- rsf2022 |> 
    dplyr::left_join(rsf2021_short, "ISO") |> 
    dplyr::mutate(`Score evolution` = Score - `Score N-1`)

  my_save_data_file("chap011/rsf", rsf2022, "rsf2022.rds")
}

rsf2022

#> # A tibble: 180 × 24
#>    ISO   Score  Rank `Political Context` Rank_Pol `Economic Context` Rank_Eco
#>    <chr> <dbl> <dbl>               <dbl>    <dbl>              <dbl>    <dbl>
#>  1 NOR    9265     1                9489        1               9038        1
#>  2 DNK    9027     2                9434        2               8367        3
#>  3 SWE    8884     3                9196        3               8766        2
#>  4 EST    8883     4                9111        5               8197        6
#>  5 FIN    8842     5                 904        6               8203        5
#>  6 IRL     883     6                8924        9               7908        8
#>  7 PRT    8707     7                9186        4               7741        9
#>  8 CRI    8592     8                8162       17               7296       11
#>  9 LTU    8414     9                8701       11               7216       13
#> 10 LIE    8403    10                8036       20               6878       19
#> # ℹ 170 more rows
#> # ℹ 17 more variables: `Legal Context` <dbl>, Rank_Leg <dbl>,
#> #   `Social Context` <dbl>, Rank_Soc <dbl>, Safety <dbl>, Rank_Saf <dbl>,
#> #   Zone <chr>, Country_EN <chr>, Country_FR <chr>, Country_ES <chr>,
#> #   Country_AR <chr>, Country_FA <chr>, `Year (N)` <dbl>, `Rank N-1` <dbl>,
#> #   `Rank evolution` <dbl>, `Score N-1` <dbl>, `Score evolution` <dbl>

Inconsistent score values

A check of the computed values for Score evalution shows inconsistent values. Compare for instance in the 2022 dataset the value of the Score column for Ireland (ISO = “IRL”) with the rest of the first ten rows. This problem exists with all score values, as one can see in the Poltical_context for Finland (FIN).

A further detailed examination revealed that

The values are not within the values 0-100 because they are lacking decimal position. For instance the Score value of Norway (NOR) for 2022 is 9265 instead of 92.65.
Trailing zeros are not displayed. Instead of 883 for Ireland the value is 8830, or 80.30 for the scale of 0-100. There are also some values with only two figures, representing values with two trailing zeros.
Reviewing thoroughly other years it turned out that some years distinguishes ties with the addition of another figure added as a (silent) comma position. Instead of 65.487 for the USA and 64.486 for Gambia in 2025 we have 65487 and 65486.

1.2.1.3 Compare column types

Before we can bind the datasets 2022-2025 together we have to solve another issue. A glimpse at the data shows that for the years 2023-2025 Score evolution is of type character with comma values (e.g., "0,63") instead of type double with a decimal point (e.g., 0.63) as used in R for decimals. To bind rows for the data of different years together we have not only to match different column names but also the types of columns.

R Code 1.5 : Inspect the columns type for the RWB dataset 2025

Listing / Output 1.5: Column types of original RWB dataset 2025

Code

rsf2025 <- readRDS(paste0(here::here(), "/data/chap011/rsf/rsf2025.rds"))
dplyr::glimpse(rsf2025)

#> Rows: 180
#> Columns: 25
#> $ ISO                 <chr> "FIN", "EST", "NLD", "SWE", "LTU", "DNK", "IRL", "…
#> $ `Score 2025`        <dbl> 8718, 8946, 8864, 8813, 8227, 8693, 8692, 8426, 83…
#> $ Rank                <dbl> 5, 2, 3, 4, 14, 6, 7, 8, 9, 10, 1, 12, 13, 11, 15,…
#> $ `Political Context` <dbl> 8993, 9087, 8995, 9007, 8076, 9113, 913, 8877, 857…
#> $ Rank_Pol            <dbl> 7, 4, 6, 5, 17, 3, 2, 8, 9, 11, 1, 18, 12, 10, 13,…
#> $ `Economic Context`  <dbl> 8054, 794, 8385, 8271, 6884, 7846, 7877, 6583, 732…
#> $ Rank_Eco            <dbl> 4, 5, 2, 3, 15, 7, 6, 20, 9, 22, 1, 18, 17, 8, 11,…
#> $ `Legal Context`     <dbl> 8793, 90, 8969, 9002, 8323, 8678, 8149, 8616, 8373…
#> $ Rank_Leg            <dbl> 7, 3, 4, 2, 18, 8, 24, 9, 15, 6, 1, 5, 28, 11, 23,…
#> $ `Social Context`    <dbl> 8387, 9161, 8805, 8499, 8539, 8385, 8725, 8674, 83…
#> $ Rank_Soc            <dbl> 12, 1, 3, 9, 8, 13, 5, 6, 15, 4, 2, 10, 7, 23, 17,…
#> $ Safety              <dbl> 9365, 9541, 9164, 9286, 9312, 9443, 958, 9381, 938…
#> $ Rank_Saf            <dbl> 14, 5, 27, 18, 16, 8, 4, 13, 12, 11, 2, 3, 1, 35, …
#> $ Zone                <chr> "UE Balkans", "UE Balkans", "UE Balkans", "UE Balk…
#> $ Country_FR          <chr> "Finlande", "Estonie", "Pays-Bas", "Su\xe8de", "Li…
#> $ Country_EN          <chr> "Finland", "Estonia", "Netherlands", "Sweden", "Li…
#> $ Country_ES          <chr> "Finlandia", "Estonia", "Pa\xedses Bajos", "Suecia…
#> $ Country_PT          <chr> "Finl\xe2ndia", "Est\xf4nia", "Pa\xedses Baixos", …
#> $ Country_AR          <chr> "??????", "???????", "??????", "??????", "????????…
#> $ Country_FA          <chr> "??????", "??????", "????", "????", "???????", "??…
#> $ `Year (N)`          <dbl> 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 20…
#> $ `Rank N-1`          <dbl> 5, 6, 4, 3, 13, 2, 8, 7, 9, 17, 1, 15, 11, 10, 12,…
#> $ `Rank evolution`    <dbl> 0, 4, 1, -1, -1, -4, 1, -1, 0, 7, 0, 3, -2, -1, -3…
#> $ `Score N-1`         <dbl> 8655, 8644, 8773, 8832, 8173, 896, 8559, 859, 8401…
#> $ `Score evolution`   <chr> "0,63", "3,02", "0,91", "-0,19", "0,54", "-2,67", …

Listing / Output 1.5 shows the problem with Score evolution. By this inspection we notice that we need also to convert all character columns to columns of type factor. But we can do this later, when we bind the rows together.

Another issue we should clean up, is that some country names even in the English version are not UTF-8 encoded. This problem only concerns the 2025 dataset:

“C�te d’Ivoire” instead of “Côte d’Ivoire” in Country_EN
“T�rkiye” instead of “Türkiye” in Country_EN and
Am�riques instead of “Amériques” in Zone.

By this occasion we learn that all regional names for all years in Zone are not in English but in French.

1.2.1.4 Clean column structure

R Code 1.6 : Clean column structure 2022-2025

Run this code chunk manually if the recoded file(s) still needs to be created

Listing / Output 1.6: Clean and reorganize column structure for row binding

Code

base::source(file = "R/helper.R")
save_path = "chap011/rsf_rec"
save_ext = "_rec.rds"
load_path = paste0(here::here(), "/data/chap011/rsf/")

rsf_batch1 <- function(df, year, path, ext) {
    if ("Score 2025" %in% names(df)) {
        df <- df |> 
            dplyr::rename(Score = `Score 2025`) |> 
            dplyr::mutate(
                Country_EN = dplyr::case_when(
                stringr::str_detect(Country_EN, "d'Ivoire") ~ "Côte d'Ivoire",
                stringr::str_detect(Country_EN, "rkiye") ~ "Türkiye",
                .default = Country_EN
            ),
                Zone = dplyr::if_else(stringr::str_detect(Zone, "riques"),
                        "Amériques", Zone)
            ) 
    }
    if ("Situation" %in% names(df)) {
        df <- dplyr::select(df, -Situation)
    }
    if (year %in% 2023:2025) {
        df <- df |> 
            dplyr::mutate(`Score evolution` =
                as.double(stringr::str_replace(`Score evolution`, ",", "."))
        )
    }

    df <- df |>
        janitor::clean_names() |>
        dplyr::relocate(country_en, .after = iso) |>
        dplyr::select(-c(country_fr:country_fa)) |>
        dplyr::relocate(year_n, .before = iso)
    my_save_data_file(path, df, 
                      paste0("rsf", year, ext))
}

get_rsf_recoded1 <- function(years, path) {
  for (i in 1:length(years)) {
    my_name <- paste0(path, "rsf", years[i], ".rds")
    file_name <- basename(my_name)
    rsf_batch1(assign(
      file_name, readRDS(my_name)),
      years[i],
      save_path,
      save_ext)
  }
}

get_rsf_recoded1(2022:2025, load_path)

(For this R code chunk is no output available)

1.2.1.5 Clean values

We are now in the position to clean the values and bind the datasets 2022-2025 by rows together. By this occasion we will create a function because we need most of code also for the other two batches.

Procedure 1.1 : Clean values for the datasets 2022-2025

Let’s resume what we want to clean up:

Bind the rows of the datasets together
Update all score figures: This includes the (global) score but also the political_context, the economic_context, the legal_context, the social_context, the safety and the score_n_1 columns. The update has to be done in a sequence of 5 steps. The correct sequence is important!
- Multiply all scores smaller than 100 by 100
- Multiply all scores smaller than 1000 by 10
- Divide all scores bigger than 10000 by 10
- Divide the new numbers by 100 to get the correct decimal scores
- Update score_evolution by subtract score_n_1 from score. (This last step is only necessary for the 2022 dataset, but I will do it as a measure of precaution for all.)
Create for every score column a new factor column with five bins: Use as bin names and for the limits the classification of the press freedom map as outlined in the methodology article for 2022 onwards. Use as names for these new columns the addition of _situation to their original score column. Reorder the new created columns from their last place to the place immediately after the score column that was indicative for the situation assessment.
Change all columns with type character into columns of type factor

R Code 1.7 : Clean data values for the datasets 2022-22025

Run this code chunk manually if rwb1 still needs to be created and saved

Listing / Output 1.7: Follow the procedure of Procedure 1.1 and clean values for the RWB datasets 2022-2025

Code

base::source(file = "R/helper.R")

## load recoded recoded rsf dataset into memory
my_get_dir_files("data/chap011/rsf_rec", "\\.rds$")


df_list1 = list(rsf2025_rec, rsf2024_rec, rsf2023_rec, rsf2022_rec)
lapply(df_list1, my_rwb_rec)

rwb1 <- dplyr::bind_rows(rwb2022, rwb2023, rwb2024, rwb2025) |>
    dplyr::mutate(dplyr::across(dplyr::where(is.character), as.factor)) |>
    dplyr::arrange(desc(year_n), country_en)


########## save file
my_save_data_file("chap011/rwb", rwb1, "rwb1.rds")

(For this R code chunk is no output available)

1.2.2 Batch 2: 2013-2021

1.2.2.1 Compare structure

We are now going to inspect the second batch of datasets: The data form the years 2013-2021.

R Code 1.8 : Compare structure of selected datasets between 2013-2021

Listing / Output 1.8: Compare the structure of selected datasets 2013-2021 by using the janitor::compare_df_cols() function

Code

janitor::compare_df_cols(rsf2021, rsf2018, rsf2016, rsf2013)

#>                      column_name   rsf2021   rsf2018   rsf2016   rsf2013
#> 1                     AR_country character character character character
#> 2                     EN_country character character character character
#> 3                     ES_country character character character character
#> 4                     FA_country character character character character
#> 5                     FR_country character character character character
#> 6                            ISO character character character character
#> 7                 Rank evolution   numeric   numeric   numeric   numeric
#> 8                         Rank N   numeric   numeric   numeric   numeric
#> 9                       Rank N-1   numeric   numeric   numeric   numeric
#> 10               Score exactions   numeric   numeric   numeric   logical
#> 11                       Score N   numeric   numeric   numeric   numeric
#> 12    Score N with the exactions   numeric   numeric   logical   logical
#> 13 Score N without the exactions   numeric   numeric   numeric   logical
#> 14                     Score N-1   numeric   numeric   numeric   numeric
#> 15                      Year (N)   numeric   numeric   numeric   numeric
#> 16                          Zone character character character character

The data frames from the years 2013 to 2021 are quite different. Datasets from 2013 to 2021 have only 16 columns, because they have only the global score. All the context variables and the safety column are missing. Although from 2022 onwards the questionnaire used completely different indicators one could compare the countries over the years with their global scores See Methodological consideration in Important 1.1. There is nothing to clean up: The context variable are only available in the first batch of datasets.

1.2.2.2 Clean column structure

Procedure 1.2 : Clean column structure for the RWB datasets 2013-2022

To clean up the column structure of the datasets 2013-2022 there are three actions necessary:

There are with Score exactions, Score N with the exactions, Score N without the exactions and Score evolution three columns that are not present in the first batch of datasets (2022-2025). There is also a column type mismatch of numeric versus logical in the three exactions columns, because in the first years of the second batch Score N with the exactions, Score N without the exactions have no (NA) values. Anyway: I couldn’t find an explanation what the exaction columns measure. So I will delete these columns.
Instead of the two-letter language code for the country names columns at the end (Country_EN), in the second batch (2013-2021) these abbreviations appear at the start the column name (EN_country). This is only important for the English names as I will use only the English variant of the country names. Other issues (such as the score value inconsistency) are the same in both dataset batches.
The columns names for the global score and rank are Score N and Rank N instead of just Scoreand Rank in the first batch of datasets. I have to rename them.

Additionally there is another issue: The Score evolution column is missing. This column is important and easy to compute because Score N and Score N-1 are present. But this change has to be done after cleaning the Score N and Score N-1 values.

Important 1.2: The column sequence for row binding is irrelevant. Important is only the match of column names.

R Code 1.9 : Clean column structure 2013-2021

Run this code chunk manually if the recoded file(s) still needs to be created

Listing / Output 1.9: Clean / reorganize column structure for row binding

Code

base::source(file = "R/helper.R")
save_path = "chap011/rsf_rec"
save_ext = "_rec2.rds"
load_path = paste0(here::here(), "/data/chap011/rsf/")

## load rsf datasets into memory
my_get_dir_files("data/chap011/rsf", "\\.rds$")

rsf_batch2 <- function(df, year, path, ext) {
    df <-  df |> 
        dplyr::select(-contains("exactions")) |>        # (1) 
        dplyr::rename(
            Country_EN = EN_country,                    # (2)                    
            Score = `Score N`,                          # (3)
            Rank = `Rank N`                             # (3)
        ) |> 
        dplyr::relocate(Country_EN, .after = ISO) |> 
        janitor::clean_names() |>
        dplyr::select(-c(fr_country:fa_country))        # (2)
    my_save_data_file(path, df, paste0("rsf", year, ext))
}

get_rsf_recoded2 <- function(years, path) {
  for (i in 1:length(years)) {
    my_name <- paste0(path, "rsf", years[i], ".rds")
    file_name <- basename(my_name)
    rsf_batch2(assign(
      file_name, readRDS(my_name)),
      years[i],
      save_path,
      save_ext)
  }
}

get_rsf_recoded2(2013:2021, load_path)

(For this R code chunk is no output available)

1.2.2.3 Clean values

R Code 1.10 : Clean data values for the datasets 2013-2021

Run this code chunk manually if rwb2 still needs to be created and saved

Listing / Output 1.10: Follow the procedure of Procedure 1.2 and clean values for the RWB datasets 2013-2021

Code

base::source(file = "R/helper.R")

## load recoded recoded rsf dataset into memory
my_get_dir_files("data/chap011/rsf_rec", "\\.rds$")


########### clean data second batch  
df_list2 = list(rsf2021_rec2, rsf2020_rec2, rsf2019_rec2, rsf2018_rec2,
                rsf2017_rec2, rsf2016_rec2, rsf2015_rec2, rsf2014_rec2,
                rsf2013_rec2)
lapply(df_list2, my_rwb_rec)

########### bind rows 
rwb2 <- dplyr::bind_rows(rwb2021, rwb2020, rwb2019, rwb2018,
                rwb2017, rwb2016, rwb2015, rwb2014, rwb2013) |>
    dplyr::mutate(dplyr::across(dplyr::where(is.character), as.factor)) |>
    dplyr::arrange(desc(year_n), country_en)

############# save file
my_save_data_file("chap011/rwb", rwb2, "rwb2.rds")

(For this R code chunk is no output available)

1.2.3 Batch 3: 2002-2011/2012

1.2.3.1 Compare structure

The third batch is the most easiest to clean up, because Score N and Score N-1 values are not comparable with the files from the other datasets batches. The used scores for the years 2002-2012 (missing 2011) range from 0 respectively in 2011/2012 (-10) for the best to a maximum of 115.5 in 2009 for the worst situation. Therefore we also don’t need the missing Score evolution to create and compute.

R Code 1.11 : Compare the structure of a selection of the datasets 2002-2012

Listing / Output 1.11: Compare the structure of a selection the datasets 2002-2012 by using the janitor::compare_df_cols() function

Code

janitor::compare_df_cols(rsf2012, rsf2008, rsf2005, rsf2002)

#>                      column_name   rsf2012   rsf2008   rsf2005   rsf2002
#> 1                     AR_country character character character character
#> 2                     EN_country character character character character
#> 3                     ES_country character character character character
#> 4                     FA_country character character character character
#> 5                     FR_country character character character character
#> 6                            ISO character character character character
#> 7                 Rank evolution   numeric   numeric   numeric   logical
#> 8                         Rank N   numeric   numeric   numeric   numeric
#> 9                       Rank N-1   numeric   numeric   numeric   logical
#> 10               Score exactions   logical   logical   logical   logical
#> 11                       Score N   numeric   numeric character character
#> 12    Score N with the exactions   logical   logical   logical   logical
#> 13 Score N without the exactions   logical   logical   logical   logical
#> 14                     Score N-1 character character character   logical
#> 15                      Year (N) character   numeric   numeric   numeric
#> 16                          Zone character character character character

1.2.3.2 Clean column structure

It turned out that the data frames from 2002-2012 have exactly the same structure as the datasets from 2013-2021. But there is one big difference: The values of the columns of Score N and Score N-1 are not compatible with the rest of the data. So when I am going to bind the rows of the different years together, I have to delete these columns to prevent misunderstandings. For the years 2002-2012 only the rank data can be used.

Procedure 1.3 : Clean structure for the RWB dataset 2002-2012

The following steps are necessary to clean up the values (and structure) of the 2002-2012

Delete all columns that contain Score. These are:

Score
Score N-1
Score N without the exactions
Score with the exactions and
Score exactions

Rename as in the second batch of dataset EN_country to Country_EN.
Rename Rank N to Rank.
Delete all languages for country names with the exception of the English names.
Skip the missing year 2011

R Code 1.12 : Clean column structure 2002-2012

Run this code chunk manually if the recoded file(s) still needs to be created

Listing / Output 1.12: Clean / reorganize column structure for row binding

Code

base::source(file = "R/helper.R")
save_path = "chap011/rsf_rec"
save_ext = "_rec3.rds"
load_path = paste0(here::here(), "/data/chap011/rsf/")

## load rsf datasets into memory
my_get_dir_files("data/chap011/rsf", "\\.rds$")

rsf_batch3 <- function(df, year, path, ext) {
    df <-  df |> 
        dplyr::select(-contains("Score")) |>            # (1) 
        dplyr::rename(
            Country_EN = EN_country,                    # (2)                    
            Rank = `Rank N`                             # (3)
        ) |> 
        dplyr::relocate(Country_EN, .after = ISO) |> 
        janitor::clean_names() |>
        dplyr::select(-c(fr_country:fa_country))        # (4)
    my_save_data_file(path, df, paste0("rsf", year, ext))
}

get_rsf_recoded3 <- function(years, path) {
  for (i in 1:length(years)) {
    if (years[i] == 2011) {next}                        # (5)
    my_name <- paste0(path, "rsf", years[i], ".rds")
    file_name <- basename(my_name)
    rsf_batch3(assign(
      file_name, readRDS(my_name)),
      years[i],
      save_path,
      save_ext)
  }
}

get_rsf_recoded3(2002:2012, load_path)

(For this R code chunk is no output available)

1.2.3.3 Clean values

After I tried to clean up the values in Listing / Output 1.13 I noticed another structural problem: As data for the year 2011 are missing the dataset for 2012 have as year value the character string 2011-12 and is therefore not compatible with the other datasets.

Procedure 1.4 : Clean values for the RWB datasets 2002-2012

Change the year_n values of the dataset 2012 from the character string 2011-12 to the numeric value of 2012
Change all columns of type character to columns of type factor.
Sort the data by year (year_n) and country name (country_en)

I applied the last two changes also to the other batches of datasets (batch 1 and 2) but didn’t mention it there in appropriate sections explicitly.

R Code 1.13 : Clean data values for the datasets 2002-2013

Run this code chunk manually if rwb3 still needs to be created and saved

Listing / Output 1.13: Follow the procedure of Procedure 1.4 and clean values for the RWB datasets 2002-2013

Code

base::source(file = "R/helper.R")

## load recoded rsf dataset into memory
my_get_dir_files("data/chap011/rsf_rec", "\\.rds$")

rwb_rec3 <- function(df) {
    for (i in seq_along(df)) {
        if (is.character(df$year_n)) {                          # (1)
            df <- dplyr::mutate(df, year_n = as.numeric(2012)). # (1)
        }
        file_name = paste0("rwb", df$year_n[1])
        assign(file_name, df, envir = globalenv())
    }
}


########### clean data third batch
df_list3 = list(rsf2012_rec3, rsf2010_rec3, rsf2009_rec3,
                rsf2008_rec3, rsf2007_rec3, rsf2006_rec3, rsf2005_rec3,
                rsf2004_rec3, rsf2003_rec3, rsf2002_rec3)
lapply(df_list3, rwb_rec3)

########## bind rows
rwb3 <- dplyr::bind_rows(rwb2012, rwb2010, rwb2009, rwb2008, rwb2007, 
                         rwb2006, rwb2005, rwb2004, rwb2003, rwb2002) |>
    dplyr::mutate(dplyr::across(dplyr::where(is.character), as.factor)) |> # (2)
    dplyr::arrange(desc(year_n), country_en)                               # (3)

########## save file
my_save_data_file("chap011/rwb", rwb3, "rwb3.rds")

(For this R code chunk is no output available)

1.2.4 All together

After I cleaned the three different dataset batches, it is now time to finish and combine the datasaets to the one I am going to working with.

R Code 1.14 : Bind the rows for the three cleaned batches of datasets rwb1.rds, rwb2.rds and rwb3.rds

Listing / Output 1.14: Bind the rows for the three cleaned batches of datasets rwb1.rds, rwb2.rds and rwb3.rds to the final `rwb.rds file.

Code

base::source(file = "R/helper.R")

######### load cleaned batches of datasets into memory
my_get_dir_files("data/chap011/rwb", "\\.rds$")


########## bind rows
rwb <- (dplyr::bind_rows(rwb1, rwb2, rwb3))


########## save file
my_save_data_file("chap011/rwb", rwb, "rwb.rds")

1.3 Get M49

To harmonize the RWB datasets with the names of the planned country geometries for the maps I need to download an official classification system. A detailed classification system expressively developed for statistical purposes is developed by the United Nations Statistics Division UNSD using the M49 methodology.

M49 is officially called Standard country or area codes for statistical use (M49) and can be downloaded manually in different languages and formats (Copy into the clipboard, Excel or CSV from the Overview page. On the page “Overview” is no URL for an R script available, because triggering one of the buttons copies or downloads the data with the help of Javascript. So I had to download the file manually or to find another location where I could download it programmatically.

The M49 specification is included in the {ISOcodes} package. But I am using the official file because it has countries and regions together in a form where no big recoded is necessary.

I found with the OMNIKA DataStore United Nations M49 Region Codes an external source for the UNSD-M49 country classification. For security reason I checked the two files with base::all.equal() to determine if those two files are identical. Yes, they are!

The UNSD M40 standard area codes are stored as Excel and CSV files. I download for reproducibility reason the CSV file.

R Code 1.15 : Get M49 classification system

Run this code chunk manually if the file(s) still needs to be downloaded.

Listing / Output 1.15: Download and store the M49 classification system

Code

## download unsd-m49 file ############
url_m49 <- "https://github.com/omnika-datastore/unsd-m49-standard-area-codes/raw/refs/heads/main/2022-09-24__CSV_UNSD_M49.csv"

downloader::download(
    url = url_m49,
    destfile = "data/m49.csv"
)


## create R object ###############
m49_raw <-
    readr::read_delim(
        file = "data/m49.csv",
        delim = ";", 
        escape_double = FALSE, 
        trim_ws = TRUE,
        show_col_types = FALSE
    )

my_save_data_file("chap011", m49_raw, "m49_raw.rds")

(For this R code chunk is no output available)

Glossary Entries

term	definition
CSV	Text files where the values are separated with commas (Comma Separated Values = CSV). These files have the file extension .csv
M49	The United Nations publication "Standard Country or Area Codes for Statistical Use" was originally published as Series M, No. 49 and is now commonly referred to as the M49 standard. M49 is a country/areas classification system prepared by the Statistics Division of the United Nations Secretariat primarily for use in its publications and databases.
OMNIKA	OMNIKA DataStore is an open-access data science resource for researchers, authors, and technologists. OMNIKA Foundation is an American 501(c)(3) nonprofit organization that operates a digital mythological library. Almost every culture has relevant mythology that explains where we came from, why things are the way they are, and a number of other things. OMNIKA's goal is to collect, organize, index, and quantify all of those data in one place and make them available for free. (https://omnika.org/info/about)
RWB	Reporters Without Borders (RWB), known by its French name Reporters sans frontières and acronym RSF, is an international non-profit and non-governmental organization headquartered in Paris, France, founded in 1985 in Montpellier by journalists Robert Ménard, Rémy Loury, Jacques Molénat, and Émilien Jubineau. It is dedicated to safeguarding the right to freedom of information and defends journalists and media personnel who are imprisoned, persecuted, or at risk for their work. The organization has consultative status at the United Nations, UNESCO, the Council of Europe, and the International Organisation of the Francophonie.
UNSD	The United Nations Statistics Division (UNSD) is committed to the advancement of the global statistical system. It compiles and disseminates global statistical information, develop standards and norms for statistical activities, and support countries' efforts to strengthen their national statistical systems.
UTF-8	UTF-8 is a character encoding system that uses between one and four eight-bit bytes to represent all valid Unicode code points. It is designed to be backward compatible with ASCII, meaning that the first 128 UTF-8 characters are identical to the ASCII characters numbered 0-127. UTF-8 has become the de facto standard character encoding for the internet and related document types, with 97.9% of websites using it by April 2023. (Brave-KI)

Session Info

Code

xfun::session_info()

#> R version 4.5.1 (2025-06-13)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sequoia 15.6.1
#> 
#> Locale: en_US.UTF-8 / en_US.UTF-8 / en_US.UTF-8 / C / en_US.UTF-8 / en_US.UTF-8
#> 
#> Package version:
#>   askpass_1.2.1       base64enc_0.1.3     bslib_0.9.0        
#>   cachem_1.1.0        cli_3.6.5           commonmark_2.0.0   
#>   compiler_4.5.1      cpp11_0.5.2         curl_7.0.0         
#>   dichromat_2.0-0.1   digest_0.6.37       dplyr_1.1.4        
#>   evaluate_1.0.5      farver_2.1.2        fastmap_1.2.0      
#>   fontawesome_0.5.3   fs_1.6.6            generics_0.1.4     
#>   glossary_1.0.0.9003 glue_1.8.0          graphics_4.5.1     
#>   grDevices_4.5.1     grid_4.5.0          here_1.0.1         
#>   highr_0.11          hms_1.1.3           htmltools_0.5.8.1  
#>   htmlwidgets_1.6.4   httr_1.4.7          janitor_2.2.1      
#>   jquerylib_0.1.4     jsonlite_2.0.0      kableExtra_1.4.0   
#>   knitr_1.50          labeling_0.4.3      lifecycle_1.0.4    
#>   litedown_0.7        lubridate_1.9.4     magrittr_2.0.4     
#>   markdown_2.0        memoise_2.0.1       methods_4.5.1      
#>   mime_0.13           openssl_2.3.3       pillar_1.11.0      
#>   pkgconfig_2.0.3     purrr_1.1.0         R6_2.6.1           
#>   rappdirs_0.3.3      RColorBrewer_1.1-3  rlang_1.1.6        
#>   rmarkdown_2.29      rprojroot_2.1.1     rstudioapi_0.17.1  
#>   rversions_2.1.2     rvest_1.0.5         sass_0.4.10        
#>   scales_1.4.0        selectr_0.4.2       snakecase_0.11.1   
#>   stats_4.5.1         stringi_1.8.7       stringr_1.5.2      
#>   svglite_2.2.1       sys_3.4.3           systemfonts_1.2.3  
#>   textshaping_1.0.1   tibble_3.3.0        tidyr_1.3.1        
#>   tidyselect_1.2.1    timechange_0.3.0    tinytex_0.57       
#>   tools_4.5.1         utf8_1.2.6          utils_4.5.1        
#>   vctrs_0.6.5         viridisLite_0.4.2   withr_3.0.2        
#>   xfun_0.53           xml2_1.4.0          yaml_2.3.10