Get the UNSD classification system using the M49 methodology. (Section 1.3)
1.1 Get RWB data
The first task is to download all datasets from the RWB website.
There is no place where I can get an integrated harmonized dataset. The only way I found out is to go to every year’s index and to use the button “Download this index”. Fortunately the link stem and the structure of the file name is the same for all years.
The side effect of Listing / Output 1.1 is a collection of data files from 2002 to 2025 (2011 missing) with the structure of rsf<year>.rds.
1.2 Clean RWB data
A manual inspections of the RSF website and of the downloaded dataset reveals three different structures of RWB datasets. My aim is to create one dataset for the whole time range 2012-2025. Therefore I have to compare the structure of the different dataset.
The first step is to load all datasets into the R memory:
R Code 1.2 : Load all datasets into R memory
Run this code chunk manually if the file(s) still needs to be loaded into memory
Listing / Output 1.2: Load all saved RWB datasets (2002-2025, 2011 missing) into R memory
(For this R code chunk is no output available. The function has the side effect that ll files with the specified path and file extensions are loaded into the R memory.)
Line 06: In 2022 are the country names in Portuguese missing. But this is not relevant for my use case as I will just use Country_EN.
Line 20 and 21: The column Score 2025 in the 2025 dataset has to be renamed to Score to get a general name for all datasets and to match the other datasets.
Line 22 and 23: Score evolution and the score of the previous year (Score N-1) is missing in the data of 2022. The reason is that in 2021 the score calculation followed a different methodology.
Line 24: There is only in 2024 a column judging the absolute score values with five predicates (in French): Bonne situation, Situation plutôt bonne, Situation problématique, Situation difficile, Situation très grave. The values for the classifications has changed with 2022 and are documented (in English) in the already mentioned article on Methodology 2022-2025.
Important 1.1: Methodological considerations
Even if the years 2022-2025 and 2013-2021 have the same scale (0-100 points) these scales measure different factors of press freedom:
2022-2025: global score, political context, economic context, legal context, social context and safety.
2013-2021: global score, pluralism, media independence, environment and self-censorship, legislative framework, transparency, infrastructure, abuses. (Only the global score is reported.)
There is no problem to compare ranks over these two different measure methods. But a comparison of the values is critical. To understand why this is the case, imagine the comparisons of cars: In one period we measure their maximum speed, in the other period we measure the economy of their fuel consumption. Even if we judge both measures with a scale of 0-100 they are not comparable.
But in our case, we have not measured different outcome / products but we have used different indicators to measure the same thing, namely press freedom. I believe therefore that it is feasible to compare the global score of the different measurements. To use our car example: In one period we measure the economy of the fuel consumption on the highway with a speed of 100 km/h, in the other period we compare it with a combined measure of driving in the city, on country roads and highways. Both methods gives us a measure about the economy of the fuel consumption, even if the components of the global score differ.
As the RWB methods in both periods used the same scale (0 for the worst, 100 for the best), I think it is legitimate to compare the differences between 2022 (the first year with the new method) with 2021 (the last year with the previous method). This reflection concerns the two missing columns in 2022: Score evolution and Score N-1. To bind the rows together we need to add these two columns with the appropriate values to 2022.
In graphs comparing global score values for years with different methods it would be helpful to signal this distinction with different colors, line types etc. in the same figure.
1.2.1.2 Adapt 2022 dataset
R Code 1.4 : Add Score N-1 and Score evolution to RWB dataset 2022
Run this code chunk manually if the 2022 still needs to be adapted
Listing / Output 1.4: Add Score N-1 from 2021 and compute Score evolution to RWB dataset 2022
Code
rsf2021_short<-readRDS(paste0(here::here(), "/data/chap011/rsf/rsf2021.rds"))|>dplyr::select(c(ISO, `Score N`))|>dplyr::rename(`Score N-1` =`Score N`)rsf2022<-readRDS(paste0(here::here(), "/data/chap011/rsf/rsf2022.rds"))## check if columns for rsf2022 are already presentif(!"Score evolution"%in%names(rsf2022)){rsf2022<-rsf2022|>dplyr::left_join(rsf2021_short, "ISO")|>dplyr::mutate(`Score evolution` =Score-`Score N-1`)my_save_data_file("chap011/rsf", rsf2022, "rsf2022.rds")}rsf2022
A check of the computed values for Score evalution shows inconsistent values. Compare for instance in the 2022 dataset the value of the Score column for Ireland (ISO = “IRL”) with the rest of the first ten rows. This problem exists with all score values, as one can see in the Poltical_context for Finland (FIN).
A further detailed examination revealed that
The values are not within the values 0-100 because they are lacking decimal position. For instance the Score value of Norway (NOR) for 2022 is 9265 instead of 92.65.
Trailing zeros are not displayed. Instead of 883 for Ireland the value is 8830, or 80.30 for the scale of 0-100. There are also some values with only two figures, representing values with two trailing zeros.
Reviewing thoroughly other years it turned out that some years distinguishes ties with the addition of another figure added as a (silent) comma position. Instead of 65.487 for the USA and 64.486 for Gambia in 2025 we have 65487 and 65486.
1.2.1.3 Compare column types
Before we can bind the datasets 2022-2025 together we have to solve another issue. A glimpse at the data shows that for the years 2023-2025 Score evolution is of type character with comma values (e.g., "0,63") instead of type double with a decimal point (e.g., 0.63) as used in R for decimals. To bind rows for the data of different years together we have not only to match different column names but also the types of columns.
R Code 1.5 : Inspect the columns type for the RWB dataset 2025
Listing / Output 1.5: Column types of original RWB dataset 2025
Listing / Output 1.5 shows the problem with Score evolution. By this inspection we notice that we need also to convert all character columns to columns of type factor. But we can do this later, when we bind the rows together.
Another issue we should clean up, is that some country names even in the English version are not UTF-8 encoded. This problem only concerns the 2025 dataset:
“C�te d’Ivoire” instead of “Côte d’Ivoire” in Country_EN
“T�rkiye” instead of “Türkiye” in Country_EN and
Am�riques instead of “Amériques” in Zone.
By this occasion we learn that all regional names for all years in Zone are not in English but in French.
1.2.1.4 Clean column structure
R Code 1.6 : Clean column structure 2022-2025
Run this code chunk manually if the recoded file(s) still needs to be created
Listing / Output 1.6: Clean and reorganize column structure for row binding
We are now in the position to clean the values and bind the datasets 2022-2025 by rows together. By this occasion we will create a function because we need most of code also for the other two batches.
Procedure 1.1 : Clean values for the datasets 2022-2025
Let’s resume what we want to clean up:
Bind the rows of the datasets together
Update all score figures: This includes the (global) score but also the political_context, the economic_context, the legal_context, the social_context, the safety and the score_n_1 columns. The update has to be done in a sequence of 5 steps. The correct sequence is important!
Multiply all scores smaller than 100 by 100
Multiply all scores smaller than 1000 by 10
Divide all scores bigger than 10000 by 10
Divide the new numbers by 100 to get the correct decimal scores
Update score_evolution by subtract score_n_1 from score. (This last step is only necessary for the 2022 dataset, but I will do it as a measure of precaution for all.)
Create for every score column a new factor column with five bins: Use as bin names and for the limits the classification of the press freedom map as outlined in the methodology article for 2022 onwards. Use as names for these new columns the addition of _situation to their original score column. Reorder the new created columns from their last place to the place immediately after the score column that was indicative for the situation assessment.
Change all columns with type character into columns of type factor
R Code 1.7 : Clean data values for the datasets 2022-22025
Run this code chunk manually if rwb1 still needs to be created and saved
Listing / Output 1.7: Follow the procedure of Procedure 1.1 and clean values for the RWB datasets 2022-2025
#> column_name rsf2021 rsf2018 rsf2016 rsf2013
#> 1 AR_country character character character character
#> 2 EN_country character character character character
#> 3 ES_country character character character character
#> 4 FA_country character character character character
#> 5 FR_country character character character character
#> 6 ISO character character character character
#> 7 Rank evolution numeric numeric numeric numeric
#> 8 Rank N numeric numeric numeric numeric
#> 9 Rank N-1 numeric numeric numeric numeric
#> 10 Score exactions numeric numeric numeric logical
#> 11 Score N numeric numeric numeric numeric
#> 12 Score N with the exactions numeric numeric logical logical
#> 13 Score N without the exactions numeric numeric numeric logical
#> 14 Score N-1 numeric numeric numeric numeric
#> 15 Year (N) numeric numeric numeric numeric
#> 16 Zone character character character character
The data frames from the years 2013 to 2021 are quite different. Datasets from 2013 to 2021 have only 16 columns, because they have only the global score. All the context variables and the safety column are missing. Although from 2022 onwards the questionnaire used completely different indicators one could compare the countries over the years with their global scores See Methodological consideration in Important 1.1. There is nothing to clean up: The context variable are only available in the first batch of datasets.
1.2.2.2 Clean column structure
Procedure 1.2 : Clean column structure for the RWB datasets 2013-2022
To clean up the column structure of the datasets 2013-2022 there are three actions necessary:
There are with Score exactions, Score N with the exactions, Score N without the exactions and Score evolution three columns that are not present in the first batch of datasets (2022-2025). There is also a column type mismatch of numeric versus logical in the three exactions columns, because in the first years of the second batch Score N with the exactions, Score N without the exactions have no (NA) values. Anyway: I couldn’t find an explanation what the exaction columns measure. So I will delete these columns.
Instead of the two-letter language code for the country names columns at the end (Country_EN), in the second batch (2013-2021) these abbreviations appear at the start the column name (EN_country). This is only important for the English names as I will use only the English variant of the country names. Other issues (such as the score value inconsistency) are the same in both dataset batches.
The columns names for the global score and rank are Score N and Rank N instead of just Scoreand Rank in the first batch of datasets. I have to rename them.
Additionally there is another issue: The Score evolution column is missing. This column is important and easy to compute because Score N and Score N-1 are present. But this change has to be done after cleaning the Score N and Score N-1 values.
Important 1.2: The column sequence for row binding is irrelevant. Important is only the match of column names.
R Code 1.9 : Clean column structure 2013-2021
Run this code chunk manually if the recoded file(s) still needs to be created
R Code 1.10 : Clean data values for the datasets 2013-2021
Run this code chunk manually if rwb2 still needs to be created and saved
Listing / Output 1.10: Follow the procedure of Procedure 1.2 and clean values for the RWB datasets 2013-2021
Code
base::source(file ="R/helper.R")## load recoded recoded rsf dataset into memorymy_get_dir_files("data/chap011/rsf_rec", "\\.rds$")########### clean data second batch df_list2=list(rsf2021_rec2, rsf2020_rec2, rsf2019_rec2, rsf2018_rec2,rsf2017_rec2, rsf2016_rec2, rsf2015_rec2, rsf2014_rec2,rsf2013_rec2)lapply(df_list2, my_rwb_rec)########### bind rows rwb2<-dplyr::bind_rows(rwb2021, rwb2020, rwb2019, rwb2018,rwb2017, rwb2016, rwb2015, rwb2014, rwb2013)|>dplyr::mutate(dplyr::across(dplyr::where(is.character), as.factor))|>dplyr::arrange(desc(year_n), country_en)############# save filemy_save_data_file("chap011/rwb", rwb2, "rwb2.rds")
(For this R code chunk is no output available)
1.2.3 Batch 3: 2002-2011/2012
1.2.3.1 Compare structure
The third batch is the most easiest to clean up, because Score N and Score N-1 values are not comparable with the files from the other datasets batches. The used scores for the years 2002-2012 (missing 2011) range from 0 respectively in 2011/2012 (-10) for the best to a maximum of 115.5 in 2009 for the worst situation. Therefore we also don’t need the missing Score evolution to create and compute.
R Code 1.11 : Compare the structure of a selection of the datasets 2002-2012
Listing / Output 1.11: Compare the structure of a selection the datasets 2002-2012 by using the janitor::compare_df_cols() function
#> column_name rsf2012 rsf2008 rsf2005 rsf2002
#> 1 AR_country character character character character
#> 2 EN_country character character character character
#> 3 ES_country character character character character
#> 4 FA_country character character character character
#> 5 FR_country character character character character
#> 6 ISO character character character character
#> 7 Rank evolution numeric numeric numeric logical
#> 8 Rank N numeric numeric numeric numeric
#> 9 Rank N-1 numeric numeric numeric logical
#> 10 Score exactions logical logical logical logical
#> 11 Score N numeric numeric character character
#> 12 Score N with the exactions logical logical logical logical
#> 13 Score N without the exactions logical logical logical logical
#> 14 Score N-1 character character character logical
#> 15 Year (N) character numeric numeric numeric
#> 16 Zone character character character character
1.2.3.2 Clean column structure
It turned out that the data frames from 2002-2012 have exactly the same structure as the datasets from 2013-2021. But there is one big difference: The values of the columns of Score N and Score N-1 are not compatible with the rest of the data. So when I am going to bind the rows of the different years together, I have to delete these columns to prevent misunderstandings. For the years 2002-2012 only the rank data can be used.
Procedure 1.3 : Clean structure for the RWB dataset 2002-2012
The following steps are necessary to clean up the values (and structure) of the 2002-2012
Delete all columns that contain Score. These are:
Score
Score N-1
Score N without the exactions
Score with the exactions and
Score exactions
Rename as in the second batch of dataset EN_country to Country_EN.
Rename Rank N to Rank.
Delete all languages for country names with the exception of the English names.
Skip the missing year 2011
R Code 1.12 : Clean column structure 2002-2012
Run this code chunk manually if the recoded file(s) still needs to be created
After I tried to clean up the values in Listing / Output 1.13 I noticed another structural problem: As data for the year 2011 are missing the dataset for 2012 have as year value the character string 2011-12 and is therefore not compatible with the other datasets.
Procedure 1.4 : Clean values for the RWB datasets 2002-2012
Change the year_n values of the dataset 2012 from the character string 2011-12 to the numeric value of 2012
Change all columns of type character to columns of type factor.
Sort the data by year (year_n) and country name (country_en)
I applied the last two changes also to the other batches of datasets (batch 1 and 2) but didn’t mention it there in appropriate sections explicitly.
R Code 1.13 : Clean data values for the datasets 2002-2013
Run this code chunk manually if rwb3 still needs to be created and saved
Listing / Output 1.13: Follow the procedure of Procedure 1.4 and clean values for the RWB datasets 2002-2013
After I cleaned the three different dataset batches, it is now time to finish and combine the datasaets to the one I am going to working with.
R Code 1.14 : Bind the rows for the three cleaned batches of datasets rwb1.rds, rwb2.rds and rwb3.rds
Listing / Output 1.14: Bind the rows for the three cleaned batches of datasets rwb1.rds, rwb2.rds and rwb3.rds to the final `rwb.rds file.
Code
base::source(file ="R/helper.R")######### load cleaned batches of datasets into memorymy_get_dir_files("data/chap011/rwb", "\\.rds$")########## bind rowsrwb<-(dplyr::bind_rows(rwb1, rwb2, rwb3))########## save filemy_save_data_file("chap011/rwb", rwb, "rwb.rds")
1.3 Get M49
To harmonize the RWB datasets with the names of the planned country geometries for the maps I need to download an official classification system. A detailed classification system expressively developed for statistical purposes is developed by the United Nations Statistics Division UNSD using the M49 methodology.
M49 is officially called Standard country or area codes for statistical use (M49) and can be downloaded manually in different languages and formats (Copy into the clipboard, Excel or CSV from the Overview page. On the page “Overview” is no URL for an R script available, because triggering one of the buttons copies or downloads the data with the help of Javascript. So I had to download the file manually or to find another location where I could download it programmatically.
The M49 specification is included in the {ISOcodes} package. But I am using the official file because it has countries and regions together in a form where no big recoded is necessary.
Text files where the values are separated with commas (Comma Separated Values = CSV). These files have the file extension .csv
M49
The United Nations publication "Standard Country or Area Codes for Statistical Use" was originally published as Series M, No. 49 and is now commonly referred to as the M49 standard. M49 is a country/areas classification system prepared by the Statistics Division of the United Nations Secretariat primarily for use in its publications and databases.
OMNIKA
OMNIKA DataStore is an open-access data science resource for researchers, authors, and technologists. OMNIKA Foundation is an American 501(c)(3) nonprofit organization that operates a digital mythological library. Almost every culture has relevant mythology that explains where we came from, why things are the way they are, and a number of other things. OMNIKA's goal is to collect, organize, index, and quantify all of those data in one place and make them available for free. (https://omnika.org/info/about)
RWB
Reporters Without Borders (RWB), known by its French name Reporters sans frontières and acronym RSF, is an international non-profit and non-governmental organization headquartered in Paris, France, founded in 1985 in Montpellier by journalists Robert Ménard, Rémy Loury, Jacques Molénat, and Émilien Jubineau. It is dedicated to safeguarding the right to freedom of information and defends journalists and media personnel who are imprisoned, persecuted, or at risk for their work. The organization has consultative status at the United Nations, UNESCO, the Council of Europe, and the International Organisation of the Francophonie.
UNSD
The United Nations Statistics Division (UNSD) is committed to the advancement of the global statistical system. It compiles and disseminates global statistical information, develop standards and norms for statistical activities, and support countries' efforts to strengthen their national statistical systems.
UTF-8
UTF-8 is a character encoding system that uses between one and four eight-bit bytes to represent all valid Unicode code points. It is designed to be backward compatible with ASCII, meaning that the first 128 UTF-8 characters are identical to the ASCII characters numbered 0-127. UTF-8 has become the de facto standard character encoding for the internet and related document types, with 97.9% of websites using it by April 2023. (Brave-KI)
# Prepare data {#sec-chap011}```{r}#| label: setup#| include: falsebase::source(file ="R/helper.R")```::::: {#obj-chap011}:::: {.my-objectives}::: {.my-objectives-header}Objectives:::::: {.my-objectives-container}The goal of this chapter is to prepare the dataset(s) I would need for my RSB Shiny dashboard. To prepare the data I need to follow several steps:- Get `r glossary("RWB")` datasets for all years from the [RWB website](https://rsf.org/en/index) (@sec-011-get-rwb-data).- Get the `r glossary("UNSD")` classification system using the `r glossary("M49")` methodology. (@sec-011-get-m49)::::::::::::## Get RWB data {#sec-011-get-rwb-data}The first task is to download all datasets from the [RWB website](https://rsf.org/en/index).There is no place where I can get an integrated harmonized dataset. The only way I found out is to go to every year’s index and to use the button "Download this index". Fortunately the link stem and the structure of the file name is the same for all years.Example: **<https://rsf.org/sites/default/files/import_classement/2025.csv>**. So I just have to change the year of the `r glossary("CSV")` file name.:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-011-get-rwb-data}: Get RWB datasets:::::::::::::{.my-r-code-container}<center>**Run this code chunk manually if the file(s) still needs to be downloaded.**</center>::: {#lst-011-get-rwb-data}```{r}#| label: get-rwb-data#| eval: falsebase::source(file ="R/helper.R")url <-"https://rsf.org/sites/default/files/import_classement/"rsf_year <-function(year) { rsf_name <-paste0(url, year, ".csv") readr::read_delim(rsf_name,delim =";", escape_double =FALSE, trim_ws =TRUE,show_col_types =FALSE)}my_year <-list("2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010",# 2011 is missing"2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019", "2020","2021", "2022", "2023", "2024", "2025")for (i in1:length(my_year)) { my_name <-paste0("rsf", my_year[[i]])my_save_data_file("chap011/rsf",assign(my_name, rsf_year(my_year[[i]])),paste0(my_name, ".rds") )}```Download and store datasets from RWB website::: <center>(*For this R code chunk is no output available*)</center>:::::::::The side effect of @lst-011-get-rwb-data is a collection of data files from 2002 to 2025 (2011 missing) with the structure of `rsf<year>.rds`.## Clean RWB dataA manual inspections of the RSF website and of the downloaded dataset reveals three different structures of RWB datasets. My aim is to create one dataset for the whole time range 2012-2025. Therefore I have to compare the structure of the different dataset.The first step is to load all datasets into the R memory::::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-011-load-all-datasets}: Load all datasets into R memory:::::::::::::{.my-r-code-container}<center>**Run this code chunk manually if the file(s) still needs to be loaded into memory**</center>```{r}#| label: load-all-datasets#| lst-label: lst-011-load-all-datasets#| lst-cap: "Load all saved RWB datasets (2002-2025, 2011 missing) into R memory"base::source(file ="R/helper.R")my_get_dir_files("data/chap011/rsf", "\\.rds$")```***<center>(*For this R code chunk is no output available. The function has the side effect that ll files with the specified path and file extensions are loaded into the R memory.*)</center>:::::::::### Batch 1: 2022-2025#### Compare structureData frames of the years 2022-2025 contain the most complete datasets. From 2022 onwards there is a new [Methodology used for compiling the World Press Freedom Indices](https://rsf.org/en/methodology-used-compiling-world-press-freedom-index-2025) worked out by panel of experts.They have slightly different column numbers and column names::::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-011-compare-structure-2022-2025}: Compare structure of the datasets 2022-2025:::::::::::::{.my-r-code-container}```{r}#| label: compare-structure-2022-2025#| lst-label: lst-011-compare-structure-2022-2025#| lst-cap: "Compare the structure of the datasets 2022-2025 by using the `janitor::compare_df_cols()` function"janitor::compare_df_cols(rsf2025, rsf2024, rsf2023, rsf2022)```:::::::::The table in @lst-011-compare-structure-2022-2025 shows the differences with `<NA>` values:- **Line 06**: In 2022 are the country names in Portuguese missing. But this is not relevant for my use case as I will just use `Country_EN`. - **Line 20 and 21**: The column `Score 2025` in the 2025 dataset has to be renamed to `Score` to get a general name for all datasets and to match the other datasets.- **Line 22 and 23**: `Score evolution` and the score of the previous year (`Score N-1`) is missing in the data of 2022. The reason is that in 2021 the score calculation followed a different methodology.- **Line 24**: There is only in 2024 a column judging the absolute score values with five predicates (in French): `r unique(rsf2024$Situation)`. The values for the classifications has changed with 2022 and are documented (in English) in the already mentioned article on [Methodology 2022-2025](https://rsf.org/en/methodology-used-compiling-world-press-freedom-index-2025).::: {.callout-important #imp-011-methodological-considerations}###### Methodological considerationsEven if the years 2022-2025 and 2013-2021 have the same scale (0-100 points) these scales measure different factors of press freedom:- **2022-2025**: global score, political context, economic context, legal context, social context and safety.- **2013-2021**: global score, pluralism, media independence, environment and self-censorship, legislative framework, transparency, infrastructure, abuses. (Only the global score is reported.)There is no problem to compare ranks over these two different measure methods. But a comparison of the values is critical. To understand why this is the case, imagine the comparisons of cars: In one period we measure their maximum speed, in the other period we measure the economy of their fuel consumption. Even if we judge both measures with a scale of 0-100 they are not comparable.But in our case, we have not measured different outcome / products but we have used different indicators to measure the same thing, namely press freedom. I believe therefore that it is feasible to compare the global score of the different measurements. To use our car example: In one period we measure the economy of the fuel consumption on the highway with a speed of 100 km/h, in the other period we compare it with a combined measure of driving in the city, on country roads and highways. Both methods gives us a measure about the economy of the fuel consumption, even if the components of the global score differ.As the RWB methods in both periods used the same scale (0 for the worst, 100 for the best), I think it is legitimate to compare the differences between 2022 (the first year with the new method) with 2021 (the last year with the previous method). This reflection concerns the two missing columns in 2022: `Score evolution` and `Score N-1`. To bind the rows together we need to add these two columns with the appropriate values to 2022.In graphs comparing global score values for years with different methods it would be helpful to signal this distinction with different colors, line types etc. in the same figure.:::#### Adapt 2022 dataset:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-011-add-2022-columns}: Add `Score N-1` and `Score evolution` to RWB dataset 2022 :::::::::::::{.my-r-code-container}<center>**Run this code chunk manually if the 2022 still needs to be adapted**</center>```{r}#| label: add-2022-columns#| lst-label: lst-011-add-2022-columns#| lst-cap: "Add `Score N-1` from 2021 and compute `Score evolution` to RWB dataset 2022"rsf2021_short <-readRDS(paste0(here::here(), "/data/chap011/rsf/rsf2021.rds")) |> dplyr::select(c(ISO, `Score N`)) |> dplyr::rename(`Score N-1`=`Score N`)rsf2022 <-readRDS(paste0(here::here(), "/data/chap011/rsf/rsf2022.rds"))## check if columns for rsf2022 are already presentif (!"Score evolution"%in%names(rsf2022)) { rsf2022 <- rsf2022 |> dplyr::left_join(rsf2021_short, "ISO") |> dplyr::mutate(`Score evolution`= Score -`Score N-1`)my_save_data_file("chap011/rsf", rsf2022, "rsf2022.rds")}rsf2022```***::::::::::::::{.my-watch}:::{.my-watch-header}Inconsistent score values:::::::{.my-watch-container}A check of the computed values for `Score evalution` shows inconsistent values. Compare for instance in the 2022 dataset the value of the `Score` column for Ireland (ISO = "IRL") with the rest of the first ten rows. This problem exists with all score values, as one can see in the `Poltical_context` for Finland (FIN). A further detailed examination revealed that - [The values are not within the values 0-100 because they are lacking decimal position.]{.mark} For instance the `Score` value of Norway (NOR) for 2022 is `9265` instead of `92.65`. - [Trailing zeros are not displayed.]{.mark} Instead of `883` for Ireland the value is `8830`, or `80.30` for the scale of 0-100. There are also some values with only two figures, representing values with *two* trailing zeros.- Reviewing thoroughly other years it turned out that some years [distinguishes ties with the addition of another figure added as a (silent) comma position.]{.mark} Instead of `65.487` for the USA and `64.486` for Gambia in 2025 we have `65487` and `65486`.:::::::::#### Compare column typesBefore we can bind the datasets 2022-2025 together we have to solve another issue. A glimpse at the data shows that for the years 2023-2025 `Score evolution` is of type `character` with comma values (e.g., `"0,63"`) instead of type `double` with a decimal point (e.g., `0.63`) as used in R for decimals. To bind rows for the data of different years together we have not only to match different column names but also the types of columns.:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-011-column-types-2025}: Inspect the columns type for the RWB dataset 2025:::::::::::::{.my-r-code-container}```{r}#| label: column-types-2025#| lst-label: lst-011-column-types-2025#| lst-cap: "Column types of original RWB dataset 2025"rsf2025 <-readRDS(paste0(here::here(), "/data/chap011/rsf/rsf2025.rds"))dplyr::glimpse(rsf2025)```:::::::::@lst-011-column-types-2025 shows the problem with `Score evolution`. By this inspection we notice that we need also to convert all `character` columns to columns of type `factor`. But we can do this later, when we bind the rows together.Another issue we should clean up, is that some country names even in the English version are not `r glossary("UTF-8")` encoded. This problem only concerns the 2025 dataset:- "C�te d'Ivoire" instead of "Côte d'Ivoire" in `Country_EN`- "T�rkiye" instead of "Türkiye" in `Country_EN` and- Am�riques instead of "Amériques" in `Zone`.By this occasion we learn that all regional names for *all* years in `Zone` are not in English but in French. #### Clean column structure:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-011-clean-columns-2022-2025}: Clean column structure 2022-2025:::::::::::::{.my-r-code-container}<center>**Run this code chunk manually if the recoded file(s) still needs to be created**</center>```{r}#| label: clean-columns-2022-2025#| lst-label: lst-011-clean-columns-2022-2025#| lst-cap: "Clean and reorganize column structure for row binding"#| eval: falsebase::source(file ="R/helper.R")save_path ="chap011/rsf_rec"save_ext ="_rec.rds"load_path =paste0(here::here(), "/data/chap011/rsf/")rsf_batch1 <-function(df, year, path, ext) {if ("Score 2025"%in%names(df)) { df <- df |> dplyr::rename(Score =`Score 2025`) |> dplyr::mutate(Country_EN = dplyr::case_when( stringr::str_detect(Country_EN, "d'Ivoire") ~"Côte d'Ivoire", stringr::str_detect(Country_EN, "rkiye") ~"Türkiye",.default = Country_EN ),Zone = dplyr::if_else(stringr::str_detect(Zone, "riques"),"Amériques", Zone) ) }if ("Situation"%in%names(df)) { df <- dplyr::select(df, -Situation) }if (year %in%2023:2025) { df <- df |> dplyr::mutate(`Score evolution`=as.double(stringr::str_replace(`Score evolution`, ",", ".")) ) } df <- df |> janitor::clean_names() |> dplyr::relocate(country_en, .after = iso) |> dplyr::select(-c(country_fr:country_fa)) |> dplyr::relocate(year_n, .before = iso)my_save_data_file(path, df, paste0("rsf", year, ext))}get_rsf_recoded1 <-function(years, path) {for (i in1:length(years)) { my_name <-paste0(path, "rsf", years[i], ".rds") file_name <-basename(my_name)rsf_batch1(assign( file_name, readRDS(my_name)), years[i], save_path, save_ext) }}get_rsf_recoded1(2022:2025, load_path)```***<center>(*For this R code chunk is no output available*)</center>:::::::::#### Clean valuesWe are now in the position to clean the values and bind the datasets 2022-2025 by rows together. By this occasion we will create a function because we need most of code also for the other two batches.:::::{.my-procedure}:::{.my-procedure-header}:::::: {#prp-011-clean-values-2022-2025}: Clean values for the datasets 2022-2025:::::::::::::{.my-procedure-container}Let's resume what we want to clean up:1. **Bind the rows of the datasets together**2. **Update *all* score figures**: This includes the (global) `score` but also the `political_context`, the `economic_context`, the `legal_context`, the `social_context`, the `safety` and the `score_n_1` columns. The update has to be done in a sequence of 5 steps. The correct sequence is important! - Multiply all scores smaller than 100 by 100 - Multiply all scores smaller than 1000 by 10 - Divide all scores bigger than 10000 by 10 - Divide the new numbers by 100 to get the correct decimal scores - Update `score_evolution` by subtract `score_n_1` from `score`. (This last step is only necessary for the 2022 dataset, but I will do it as a measure of precaution for all.)3. **Create for every score column a new factor column with five bins:** Use as bin names and for the limits the classification of the press freedom map as outlined in the [methodology article for 2022 onwards](https://rsf.org/en/methodology-used-compiling-world-press-freedom-index-2025). Use as names for these new columns the addition of `_situation` to their original score column. Reorder the new created columns from their last place to the place immediately after the score column that was indicative for the situation assessment.4. **Change all columns with type `character` into columns of type `factor`**::::::::::::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-011-clean-values-2022-2025}: Clean data values for the datasets 2022-22025:::::::::::::{.my-r-code-container}<center>**Run this code chunk manually if `rwb1` still needs to be created and saved**</center>```{r}#| label: clean-values-2022-2025#| lst-label: lst-011-clean-values-2022-2025#| lst-cap: "Follow the procedure of @prp-011-clean-values-2022-2025 and clean values for the RWB datasets 2022-2025"#| eval: falsebase::source(file ="R/helper.R")## load recoded recoded rsf dataset into memorymy_get_dir_files("data/chap011/rsf_rec", "\\.rds$")df_list1 =list(rsf2025_rec, rsf2024_rec, rsf2023_rec, rsf2022_rec)lapply(df_list1, my_rwb_rec)rwb1 <- dplyr::bind_rows(rwb2022, rwb2023, rwb2024, rwb2025) |> dplyr::mutate(dplyr::across(dplyr::where(is.character), as.factor)) |> dplyr::arrange(desc(year_n), country_en)########## save filemy_save_data_file("chap011/rwb", rwb1, "rwb1.rds")```***<center>(*For this R code chunk is no output available*)</center>:::::::::### Batch 2: 2013-2021#### Compare structureWe are now going to inspect the second batch of datasets: The data form the years 2013-2021.:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-011-compare-structure-2013-2021}: Compare structure of selected datasets between 2013-2021:::::::::::::{.my-r-code-container}```{r}#| label: compare-structure-2013-2021#| lst-label: lst-011-compare-structure-2013-2021#| lst-cap: "Compare the structure of selected datasets 2013-2021 by using the `janitor::compare_df_cols()` function"#| results: holdjanitor::compare_df_cols(rsf2021, rsf2018, rsf2016, rsf2013)```:::::::::The data frames from the years 2013 to 2021 are quite different. Datasets from 2013 to 2021 have only 16 columns, because they have only the global score. All the `context` variables and the `safety` column are missing. Although from 2022 onwards the questionnaire used completely different indicators one could compare the countries over the years with their *global* scores See Methodological consideration in @imp-011-methodological-considerations. There is nothing to clean up: The `context` variable are only available in the first batch of datasets.#### Clean column structure:::::{.my-procedure}:::{.my-procedure-header}:::::: {#prp-011-clean-structure-2013-2022}: Clean column structure for the RWB datasets 2013-2022:::::::::::::{.my-procedure-container}To clean up the column structure of the datasets 2013-2022 there are three actions necessary:1. There are with `Score exactions`, `Score N with the exactions`, `Score N without the exactions` and `Score evolution` three columns that are not present in the first batch of datasets (2022-2025). There is also a column type mismatch of numeric versus logical in the three `exactions` columns, because in the first years of the second batch `Score N with the exactions`, `Score N without the exactions` have no (`NA`) values. Anyway: I couldn't find an explanation what the `exaction` columns measure. So I will delete these columns.2. Instead of the two-letter language code for the country names columns at the end (`Country_EN`), in the second batch (2013-2021) these abbreviations appear at the start the column name (`EN_country`). This is only important for the English names as I will use only the English variant of the country names. Other issues (such as the score value inconsistency) are the same in both dataset batches.3. The columns names for the global score and rank are `Score N` and `Rank N` instead of just `Score`and `Rank` in the first batch of datasets. I have to rename them.Additionally there is another issue: The `Score evolution` column is missing. This column is important and easy to compute because `Score N` and `Score N-1` are present. But this change has to be done after cleaning the `Score N` and `Score N-1` values.:::::::::::: {.callout-important #imp-011-column-sequencefor-row-binding}###### The column sequence for row binding is irrelevant. Important is only the match of column names. ::::::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-011-clean-columns-2013-2021}: Clean column structure 2013-2021:::::::::::::{.my-r-code-container}<center>**Run this code chunk manually if the recoded file(s) still needs to be created**</center>```{r}#| label: clean-columns-2013-2021#| lst-label: lst-011-clean-columns-2013-2021#| lst-cap: "Clean / reorganize column structure for row binding"#| eval: falsebase::source(file ="R/helper.R")save_path ="chap011/rsf_rec"save_ext ="_rec2.rds"load_path =paste0(here::here(), "/data/chap011/rsf/")## load rsf datasets into memorymy_get_dir_files("data/chap011/rsf", "\\.rds$")rsf_batch2 <-function(df, year, path, ext) { df <- df |> dplyr::select(-contains("exactions")) |># (1) dplyr::rename(Country_EN = EN_country, # (2) Score =`Score N`, # (3)Rank =`Rank N`# (3) ) |> dplyr::relocate(Country_EN, .after = ISO) |> janitor::clean_names() |> dplyr::select(-c(fr_country:fa_country)) # (2)my_save_data_file(path, df, paste0("rsf", year, ext))}get_rsf_recoded2 <-function(years, path) {for (i in1:length(years)) { my_name <-paste0(path, "rsf", years[i], ".rds") file_name <-basename(my_name)rsf_batch2(assign( file_name, readRDS(my_name)), years[i], save_path, save_ext) }}get_rsf_recoded2(2013:2021, load_path)```***<center>(*For this R code chunk is no output available*)</center>:::::::::#### Clean values:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-011-clean-values-2013-2021}: Clean data values for the datasets 2013-2021:::::::::::::{.my-r-code-container}<center>**Run this code chunk manually if `rwb2` still needs to be created and saved**</center>```{r}#| label: clean-values-2013-2021#| lst-label: lst-011-clean-values-2013-2021#| lst-cap: "Follow the procedure of @prp-011-clean-structure-2013-2022 and clean values for the RWB datasets 2013-2021"#| eval: falsebase::source(file ="R/helper.R")## load recoded recoded rsf dataset into memorymy_get_dir_files("data/chap011/rsf_rec", "\\.rds$")########### clean data second batch df_list2 =list(rsf2021_rec2, rsf2020_rec2, rsf2019_rec2, rsf2018_rec2, rsf2017_rec2, rsf2016_rec2, rsf2015_rec2, rsf2014_rec2, rsf2013_rec2)lapply(df_list2, my_rwb_rec)########### bind rows rwb2 <- dplyr::bind_rows(rwb2021, rwb2020, rwb2019, rwb2018, rwb2017, rwb2016, rwb2015, rwb2014, rwb2013) |> dplyr::mutate(dplyr::across(dplyr::where(is.character), as.factor)) |> dplyr::arrange(desc(year_n), country_en)############# save filemy_save_data_file("chap011/rwb", rwb2, "rwb2.rds")```***<center>(*For this R code chunk is no output available*)</center>:::::::::### Batch 3: 2002-2011/2012#### Compare structureThe third batch is the most easiest to clean up, because `Score N` and `Score N-1` values are not comparable with the files from the other datasets batches. The used scores for the years 2002-2012 (missing 2011) range from 0 respectively in 2011/2012 (-10) for the best to a maximum of 115.5 in 2009 for the worst situation. Therefore we also don't need the missing `Score evolution` to create and compute.:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-011-compare-structure-2002-2012}: Compare the structure of a selection of the datasets 2002-2012:::::::::::::{.my-r-code-container}```{r}#| label: compare-structure-2002-2012#| lst-label: lst-011-compare-structure-2002-2012#| lst-cap: "Compare the structure of a selection the datasets 2002-2012 by using the `janitor::compare_df_cols()` function"#| results: holdjanitor::compare_df_cols(rsf2012, rsf2008, rsf2005, rsf2002)```:::::::::#### Clean column structureIt turned out that the data frames from 2002-2012 have exactly the same structure as the datasets from 2013-2021. But there is one big difference: The values of the columns of `Score N` and `Score N-1` are not compatible with the rest of the data. So when I am going to bind the rows of the different years together, I have to delete these columns to prevent misunderstandings. For the years 2002-2012 only the rank data can be used. :::::{.my-procedure}:::{.my-procedure-header}:::::: {#prp-011-clean-values-2002-2012}: Clean structure for the RWB dataset 2002-2012:::::::::::::{.my-procedure-container}The following steps are necessary to clean up the values (and structure) of the 2002-20121. Delete all columns that contain `Score`. These are: - `Score` - `Score N-1` - `Score N without the exactions` - `Score with the exactions` and - `Score exactions`2. Rename as in the second batch of dataset `EN_country` to `Country_EN`.3. Rename `Rank N` to `Rank`.4. Delete all languages for country names with the exception of the English names.5. Skip the missing year 2011::::::::::::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-011-clean-columns-2002-2012}: Clean column structure 2002-2012:::::::::::::{.my-r-code-container}<center>**Run this code chunk manually if the recoded file(s) still needs to be created**</center>```{r}#| label: clean-columns-2002-2012#| lst-label: lst-011-clean-columns-2002-2012#| lst-cap: "Clean / reorganize column structure for row binding"#| eval: falsebase::source(file ="R/helper.R")save_path ="chap011/rsf_rec"save_ext ="_rec3.rds"load_path =paste0(here::here(), "/data/chap011/rsf/")## load rsf datasets into memorymy_get_dir_files("data/chap011/rsf", "\\.rds$")rsf_batch3 <-function(df, year, path, ext) { df <- df |> dplyr::select(-contains("Score")) |># (1) dplyr::rename(Country_EN = EN_country, # (2) Rank =`Rank N`# (3) ) |> dplyr::relocate(Country_EN, .after = ISO) |> janitor::clean_names() |> dplyr::select(-c(fr_country:fa_country)) # (4)my_save_data_file(path, df, paste0("rsf", year, ext))}get_rsf_recoded3 <-function(years, path) {for (i in1:length(years)) {if (years[i] ==2011) {next} # (5) my_name <-paste0(path, "rsf", years[i], ".rds") file_name <-basename(my_name)rsf_batch3(assign( file_name, readRDS(my_name)), years[i], save_path, save_ext) }}get_rsf_recoded3(2002:2012, load_path)```***<center>(*For this R code chunk is no output available*)</center>:::::::::#### Clean valuesAfter I tried to clean up the values in @lst-011-clean-values-2002-2012 I noticed another structural problem: As data for the year 2011 are missing the dataset for 2012 have as `year` value the character string `2011-12` and is therefore not compatible with the other datasets. :::::{.my-procedure}:::{.my-procedure-header}:::::: {#prp-011-clean-values-2002-2012}: Clean values for the RWB datasets 2002-2012:::::::::::::{.my-procedure-container}1. Change the `year_n` values of the dataset 2012 from the character string `2011-12` to the numeric value of `2012`2. Change all columns of type `character` to columns of type `factor.`3. Sort the data by year (`year_n`) and country name (`country_en`)I applied the last two changes also to the other batches of datasets (batch 1 and 2) but didn't mention it there in appropriate sections explicitly.::::::::::::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-011-clean-values-2002-2013}: Clean data values for the datasets 2002-2013:::::::::::::{.my-r-code-container}<center>**Run this code chunk manually if `rwb3` still needs to be created and saved**</center>```{r}#| label: clean-values-2002-2012#| lst-label: lst-011-clean-values-2002-2012#| lst-cap: "Follow the procedure of @prp-011-clean-values-2002-2012 and clean values for the RWB datasets 2002-2013"#| eval: falsebase::source(file ="R/helper.R")## load recoded rsf dataset into memorymy_get_dir_files("data/chap011/rsf_rec", "\\.rds$")rwb_rec3 <-function(df) {for (i inseq_along(df)) {if (is.character(df$year_n)) { # (1) df <- dplyr::mutate(df, year_n =as.numeric(2012)). # (1) } file_name =paste0("rwb", df$year_n[1])assign(file_name, df, envir =globalenv()) }}########### clean data third batchdf_list3 =list(rsf2012_rec3, rsf2010_rec3, rsf2009_rec3, rsf2008_rec3, rsf2007_rec3, rsf2006_rec3, rsf2005_rec3, rsf2004_rec3, rsf2003_rec3, rsf2002_rec3)lapply(df_list3, rwb_rec3)########## bind rowsrwb3 <- dplyr::bind_rows(rwb2012, rwb2010, rwb2009, rwb2008, rwb2007, rwb2006, rwb2005, rwb2004, rwb2003, rwb2002) |> dplyr::mutate(dplyr::across(dplyr::where(is.character), as.factor)) |># (2) dplyr::arrange(desc(year_n), country_en) # (3)########## save filemy_save_data_file("chap011/rwb", rwb3, "rwb3.rds")```***<center>(*For this R code chunk is no output available*)</center>:::::::::### All togetherAfter I cleaned the three different dataset batches, it is now time to finish and combine the datasaets to the one I am going to working with.:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-011-all-batches-together}: Bind the rows for the three cleaned batches of datasets `rwb1.rds`, `rwb2.rds` and `rwb3.rds`:::::::::::::{.my-r-code-container}```{r}#| label: all-batches-together#| lst-label: lst-011-all-batches-together#| lst-cap: "Bind the rows for the three cleaned batches of datasets `rwb1.rds`, `rwb2.rds` and `rwb3.rds` to the final `rwb.rds file."#| eval: falsebase::source(file ="R/helper.R")######### load cleaned batches of datasets into memorymy_get_dir_files("data/chap011/rwb", "\\.rds$")########## bind rowsrwb <- (dplyr::bind_rows(rwb1, rwb2, rwb3))########## save filemy_save_data_file("chap011/rwb", rwb, "rwb.rds")```:::::::::## Get M49 {#sec-011-get-m49}To harmonize the RWB datasets with the names of the planned country geometries for the maps I need to download an official classification system. A detailed classification system expressively developed for statistical purposes is developed by the United Nations StatisticsDivision `r glossary("UNSD")` using the `r glossary("M49")` methodology.M49 is officially called [Standard country or area codes for statistical use(M49)](https://unstats.un.org/unsd/methodology/m49/) and can bedownloaded manually in different languages and formats (Copy into theclipboard, Excel or `r glossary("CSV")` from the [Overviewpage](https://unstats.un.org/unsd/methodology/m49/overview/). On the page "Overview" isno URL for an R script available, because triggering one of the buttonscopies or downloads the data with the help of Javascript. So I had to download the file manually or to find another location where I could download it programmatically. The M49 specification is included in the {**ISOcodes**} package. But I am using the official file because it has countries and regions together in a form where no big recoded is necessary. I found with the `r glossary("OMNIKA")` DataStore [United Nations M49 Region Codes](https://omnika.org/datastore/datasets/un-m49-region-codes) an [external source for the UNSD-M49 country classification](https://github.com/omnika-datastore/unsd-m49-standard-area-codes). For security reason I checked the two files with `base::all.equal()` to determine if those two files are identical. Yes, they are!The UNSD M40 standard area codes are stored as Excel and CSV files. I download for reproducibility reason the CSV file.:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-011-get-m49}: Get M49 classification system:::::::::::::{.my-r-code-container}<center>**Run this code chunk manually if the file(s) still needs to be downloaded.**</center>::: {#lst-011-get-m49}```{r}#| label: get-m49#| eval: false## download unsd-m49 file ############url_m49 <-"https://github.com/omnika-datastore/unsd-m49-standard-area-codes/raw/refs/heads/main/2022-09-24__CSV_UNSD_M49.csv"downloader::download(url = url_m49,destfile ="data/m49.csv")## create R object ###############m49_raw <- readr::read_delim(file ="data/m49.csv",delim =";", escape_double =FALSE, trim_ws =TRUE,show_col_types =FALSE )my_save_data_file("chap011", m49_raw, "m49_raw.rds")```Download and store the M49 classification system ::: <center>(*For this R code chunk is no output available*)</center>:::::::::## Glossary Entries {.unnumbered}```{r}#| label: glossary-table#| echo: falseglossary_table()```------------------------------------------------------------------------## Session Info {.unnumbered}::: my-r-code::: my-r-code-headerSession Info:::::: my-r-code-container```{r}#| label: session-infoxfun::session_info()```::::::