perccalc
is very flexible and can be used for
any ordered categorical variable which has a theoretical order, such as
education or occupation. In this case study we will calculate the
percentile difference in mathematics test scores based on the education
of the parents for several countries using the PISA 2006 and PISA 2012
datasets. Let’s load our packages:
percalc
automatically loads pisa_2012
and
pisa_2006
which are two datasets with all the information
that we need. These two datasets have data for Estonia, Germany and
Spain and contain the five test scores in Mathematics and the father’s
education in the international ISCED classification. The first thing we
have to do is make sure the categories are ordered and calculate the
average test score for each student.
order_edu <- c("None",
"ISCED 1",
"ISCED 2",
"ISCED 3A, ISCED 4",
"ISCED 3B, C",
"ISCED 5A, 6",
"ISCED 5B")
# Make ordered categories of our categorical variables and calculate avgerage
# math test scores for each year
pisa_2012 <-
pisa_2012 %>%
mutate(father_edu = factor(father_edu, levels = order_edu, ordered = TRUE))
pisa_2006 <-
pisa_2006 %>%
mutate(father_edu = factor(father_edu, levels = order_edu, ordered = TRUE))
# Merge them together
pisa <- rbind(pisa_2006, pisa_2012)
Once the categories are ordered, perc_diff
can calculate
the percentile difference between the 90th and 10th percentile for the
complete sample. For example:
perc_diff(data_model = pisa,
categorical_var = father_edu,
continuous_var = avg_math,
percentiles = c(90, 10))
#> difference se
#> 45.07312 14.18308
This means that the difference in Mathematics test scores between the
90th and 10th percentile of father education is 45 points with a
standard error of 14 points. We can extrapolate this example for each
country separately using dplyr::group_by
. Since the result
of perc_diff
is a vector, to be able to work seamlessly
with dplyr::group_by
we will use perc_diff_df
,
which returns the same results but as a data frame:
With df_wrap
we can just pass that to
tidyr::nest
and dplyr::mutate
and calculate it
separetely by year and country:
cnt_diff <-
pisa %>%
nest(data = c(-year, -CNT)) %>%
mutate(edu_diff = lapply(data, function(x) perc_diff_df(x, father_edu, avg_math, percentiles = c(90, 10)))) %>%
select(-data) %>%
unnest(edu_diff)
cnt_diff
#> # A tibble: 6 × 4
#> year CNT difference se
#> <dbl> <chr> <dbl> <dbl>
#> 1 2006 Estonia 18.3 5.55
#> 2 2006 Germany 61.2 28.8
#> 3 2006 Spain 52.6 15.1
#> 4 2012 Germany 45.0 27.4
#> 5 2012 Spain 40.2 18.2
#> 6 2012 Estonia 17.0 9.88
We can see some big differences between, for example, Estonia and Spain. But even more interesting is plotting this:
cnt_diff %>%
ggplot(aes(year, difference, group = CNT, color = CNT)) +
geom_point() +
geom_line() +
theme_minimal() +
scale_y_continuous(name = "Achievement gap in Math between the 90th and \n 10th percentile of father's education") +
scale_x_continuous(name = "Year")
It looks like Estonia has a much smaller achievement gap relative to Spain and Germany but also note that both Germany and Spain have been decreasing their inequality. We can also try different achievement gaps to to explore the distribution:
# Calculate the gap for the 90/50 gap
cnt_half <-
pisa %>%
nest(data = c(-year, -CNT)) %>%
mutate(edu_diff = lapply(data,
function(x) perc_diff_df(x, father_edu, avg_math, percentiles = c(90, 50)))) %>%
select(-data) %>%
unnest(edu_diff)
# Calculate the gap for the 50/10 gap
cnt_bottom <-
pisa %>%
nest(data = c(-year, -CNT)) %>%
mutate(edu_diff = lapply(data,
function(x) perc_diff_df(x, father_edu, avg_math, percentiles = c(50, 10)))) %>%
select(-data) %>%
unnest(edu_diff)
cnt_diff$type <- "90/10"
cnt_half$type <- "90/50"
cnt_bottom$type <- "50/10"
final_cnt <- rbind(cnt_diff, cnt_half, cnt_bottom)
final_cnt$type <- factor(final_cnt$type, levels = c("90/10", "90/50", "50/10"))
final_cnt
#> # A tibble: 18 × 5
#> year CNT difference se type
#> <dbl> <chr> <dbl> <dbl> <fct>
#> 1 2006 Estonia 18.3 5.55 90/10
#> 2 2006 Germany 61.2 28.8 90/10
#> 3 2006 Spain 52.6 15.1 90/10
#> 4 2012 Germany 45.0 27.4 90/10
#> 5 2012 Spain 40.2 18.2 90/10
#> 6 2012 Estonia 17.0 9.88 90/10
#> 7 2006 Estonia -12.8 5.51 90/50
#> 8 2006 Germany 1.98 27.6 90/50
#> 9 2006 Spain 10.2 14.2 90/50
#> 10 2012 Germany -17.5 23.5 90/50
#> 11 2012 Spain -3.38 17.1 90/50
#> 12 2012 Estonia -17.0 9.34 90/50
#> 13 2006 Estonia 31.2 5.55 50/10
#> 14 2006 Germany 59.2 26.6 50/10
#> 15 2006 Spain 42.4 13.9 50/10
#> 16 2012 Germany 62.4 26.1 50/10
#> 17 2012 Spain 43.6 17.1 50/10
#> 18 2012 Estonia 34.0 9.91 50/10
Having this dataframe we can visualize all the three gaps in a very intuitive fashion:
final_cnt %>%
ggplot(aes(year, difference, group = CNT, color = CNT)) +
geom_point() +
geom_line() +
theme_minimal() +
scale_y_continuous(name = "Achievement gap in Math between the 90th and \n 10th percentile of father's education") +
scale_x_continuous(name = "Year") +
facet_wrap(~ type)
It seems that the 90/50
and 50/10
differences are not symmetrical.
percalc
also has a perc_dist
function which
calculates the distribution of the percentiles, so we can compare more
finegrained percentiles rather than differences:
perc_dist(pisa, father_edu, avg_math)
#> # A tibble: 100 × 3
#> percentile estimate std.error
#> <int> <dbl> <dbl>
#> 1 1 0.853 1.64
#> 2 2 1.74 3.20
#> 3 3 2.65 4.69
#> 4 4 3.60 6.10
#> 5 5 4.57 7.45
#> 6 6 5.57 8.72
#> 7 7 6.59 9.93
#> 8 8 7.64 11.1
#> 9 9 8.71 12.2
#> 10 10 9.80 13.2
#> # ℹ 90 more rows
Here we get the complete percentile distribution with the test score in mathematics for each percentile. This can be easily scaled to all country/year combinations with our previous code:
cnt_dist <-
pisa %>%
nest(data = c(-year, -CNT)) %>%
mutate(edu_diff = lapply(data, function(x) perc_dist(x, father_edu, avg_math))) %>%
select(-data) %>%
unnest(edu_diff)
cnt_dist
#> # A tibble: 600 × 5
#> year CNT percentile estimate std.error
#> <dbl> <chr> <int> <dbl> <dbl>
#> 1 2006 Estonia 1 0.647 0.388
#> 2 2006 Estonia 2 1.31 0.762
#> 3 2006 Estonia 3 2.00 1.12
#> 4 2006 Estonia 4 2.71 1.47
#> 5 2006 Estonia 5 3.43 1.81
#> 6 2006 Estonia 6 4.17 2.13
#> 7 2006 Estonia 7 4.93 2.44
#> 8 2006 Estonia 8 5.70 2.74
#> 9 2006 Estonia 9 6.48 3.02
#> 10 2006 Estonia 10 7.28 3.30
#> # ℹ 590 more rows
Let’s limit the distribution only to the 10th, 20th, 30th… 100th percentile an compare for country/years:
cnt_dist %>%
mutate(year = as.character(year)) %>%
filter(percentile %in% seq(0, 100, by = 10)) %>%
ggplot(aes(percentile, estimate, color = year, group = percentile)) +
geom_point() +
geom_line(color = "black") +
scale_y_continuous(name = "Math test score") +
scale_x_continuous(name = "Percentiles from father's education") +
scale_color_discrete(name = "Year") +
facet_wrap(~ CNT) +
theme_minimal()
Here the red dots indicate the year 2006 and the bluish dots year 2012, the black line between them indicates the change over time. Here we can see that although Germany and Spain are decreasing (as we saw in the plot before), the composition of the change is very different: Spain’s decrease is big all around the distribution whereas Germany’s concentrate on the top percentiles.
This type of analysis serves well to disentangle and decompose distributions of ordered categorical. This vignette looked to show the power of this techniqe and how it can be used in practically any setting with an ordered categorical variable and a continuous variable.
If you use this in a publication, remember to cite the software as:
#> To cite perccalc in publications use:
#>
#> Cimentada, (2019). perccalc: An R package for estimating percentiles
#> from categorical variables. Journal of Open Source Software, 4(44),
#> 1796, https://doi.org/10.21105/joss.01796
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Article{,
#> doi = {10.21105/joss.01796},
#> url = {https://doi.org/10.21105/joss.01796},
#> year = {2019},
#> publisher = {The Open Journal},
#> volume = {4},
#> number = {44},
#> pages = {1796},
#> author = {Jorge Cimentada},
#> title = {perccalc: An R package for estimating percentiles from categorical variables},
#> journal = {Journal of Open Source Software},
#> }