Take-Home Exercise 1 Take-2
In this take-home exercise, we are required to:
I have selected Che Xuan’s submission for this exercise.
Following are the charts in the submission:
For simplicity, all codes from the original submission will not be showed in this webpage. You may refer to Che Xuan’s webpage for more information.
These packages tidyverse (including dplyr, magrittr, ggplot2, patchwork, ggdist, tidyquant), ggrepel,ggdist, ggridges, patchwork, ggthemes will be used for the purpose of this exercise.
The code chunk below is used to install and load the required
packages onto RStudio. We will also import Participants.csv
from the data folder into R by using read_csv() and save it
as an tibble data frame called demographic_data.
packages = c('tidyverse','ggrepel','ggdist', 'ggridges', 'patchwork', 'ggthemes')
for(p in packages){
if(!require(p, character.only =T)){
install.packages(p)
}
library(p, character.only =T)
}
demographic_data <- read_csv("data/Participants.csv")
Rows: 1,011
Columns: 7
$ participantId <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,~
$ householdSize <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~
$ haveKids <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU~
$ age <dbl> 36, 25, 35, 21, 43, 32, 26, 27, 20, 35, 48, 2~
$ educationLevel <chr> "HighSchoolOrCollege", "HighSchoolOrCollege",~
$ interestGroup <chr> "H", "B", "A", "I", "H", "D", "I", "A", "G", ~
$ joviality <dbl> 0.001626703, 0.328086500, 0.393469590, 0.1380~
In my own submission for Take-Home Exercise 1, I have provided a brief description and conducted a series of data exploration and wrangling on the dataset. You may wish to visit this page for more information.
For the purpose of this exercise, I will retain the author’s approach and will only make changes to his/her charts from a data visualisation point of view.
First, we start by renaming the columns and values of in
demographic_data file using the function rename(),
and sub()
for a better format and ease of reading.
# rename columns
demographic_data <- demographic_data %>%
rename('Participant_ID' = 'participantId',
'Household_Size' = 'householdSize',
'Have_Kids' = 'haveKids',
'Age' = 'age',
'Education_Level' = 'educationLevel',
'Interest_Group' = 'interestGroup',
'Joviality' = 'joviality')
#rename row values
demographic_data$Education_Level <- sub('HighSchoolOrCollege',
'High School or College',
demographic_data$Education_Level)
demographic_data$Have_Kids <- sub('TRUE',
'Yes',
demographic_data$Have_Kids)
demographic_data$Have_Kids <- sub('FALSE',
'No',
demographic_data$Have_Kids)
Next, we look at each chart one by one.
Clarity:
(+) It is helpful to include a plot title and rename the y and x-axis label for better understanding of the charts. The appended statistics also helps users to see the difference in size among the household sizes.
(-) However, given that there are only 3 household sizes (1,2,3), there may not be a need to rearrange the bar in descending order. In this case. retaining the x-axis order may allow user to interpret quickly and intuitively what are the most and least common household size among the participants. A summary on the chart findings could also be helpful for the user.
Aesthetics:
(+) Setting y-axis limit is helpful to see clearly the data label.
(-) Orientation of the y-axis label could be changed for ease of reading.
Changes:
Other than addressing the (-) above, we also:
uses the theme() function to lighten the background
to enhance the dark grey bar charts, centralise the plot title and
increase its font size
uses the axis argument to change the y-axis label orientation, panel color, axis color
the renamed data value of the educational level will serve to improve the readability of the x-axis
included an annotation on the our observations
customise the figure size to make it bigger
p1b <- ggplot(data = demographic_data,
aes(x=Household_Size)) +
geom_bar() +
ylim(0,500) +
geom_text(stat="count",
aes(label=paste0(..count.., " (",
round(..count../sum(..count..)*100,
1), "%)")),
vjust=-1) +
ylab("No. of\nParticipants") +
ggtitle("Participants' Household Size")+
theme_minimal()+
theme(plot.title = element_text(size=14, face="bold",hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))+
theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
panel.background= element_blank(), axis.line= element_line(color= 'grey'))+
annotate("text",
x = 2.5,
y = 480,
label = "Small Family Sizes of 1 to 3, \n with almost equal portion across the 3 groups",size=5,color='mediumvioletred')
The same observations are made for the bar chart on education level and hence similar changes are applied.
demographic_data$Education_Level = factor(demographic_data$Education_Level, levels = c('Low', 'High School or College', 'Bachelors','Graduate'))
p2b <- ggplot(data = demographic_data,
aes(x=Education_Level)) +
geom_bar() +
ylim(0,650) +
geom_text(stat="count",
aes(label=paste0(..count.., " (",
round(..count../sum(..count..)*100,
1), "%)")),
vjust=-1) +
ylab("No. of\nParticipants") +
ggtitle("Participants' Education Level")+
theme_minimal()+
theme(plot.title = element_text(size=14, face="bold",hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))+
theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
panel.background= element_blank(), axis.line= element_line(color= 'grey'))+
annotate("text",
x = 3.2,
y = 625,
label = ">50% is a high school or college graduate,\n ~40% holds a min. bachelors degree.",size=5,color='mediumvioletred')
Clarity:
(+) It is helpful to provide plot title for users to better understand the purpose of the chart, and mean value of joviality allow users to compare the changes among the interest groups.
(-) However, without drawing any observations, we are not able to ascertain the author’s intention of the chart. Hence, we are unsure if the comparison of joviality are meant to be among different interest groups within participants with kids or no kids, or the comparison is within each interest group between participants with kids or no kids or both. There is also no legend to what “False” and “True” meant in the chart.
Aesthetics:
(+) The plots are stacked (with and without kids) for easy comparison.
(-) Given that a box plot would give a median point, there may not be a need to include a mean value. The canvas could also be expanded to better appreciate the difference in the joviality level.
Changes:
Other than addressing the (-) above, we also:
adopted a bar chart geom_bar() with median jovaility
and proper legend instead of a trellis boxplot to achieve the same
purpose (assuming the author’s intent is to compare the average
joviality of participants within each interest group between those with
kids and those without kids) but with greater clarity;
use the geom_hline() and geom_text() to
insert an vertical line at 0.5 for ease of comparison
uses the axis argument to change the y-axis label orientation, panel color, axis color
included an annotation() on the our
observations
customise the figure size to make it bigger
p3a <- ggplot(data=demographic_data,
aes(y = Joviality, x= Interest_Group)) +
geom_boxplot() +
stat_summary(geom = "point",
fun.y="mean",
colour ="red",
size=3) +
facet_grid(Have_Kids ~.) +
ggtitle("Joviality across Interest Groups by Kids Status")
h_line <- 0.4
p3b <- ggplot(data=demographic_data) +
geom_bar(aes(Interest_Group, Joviality,fill=Have_Kids),
color="black", position = "dodge", stat = "summary", fun = "mean")+
ylim(0,0.70)+
geom_hline(aes(yintercept = h_line),linetype="dashed", color = "blue") +
geom_text(aes(0, h_line, label = h_line, vjust = - 1))+
theme(axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1))+
coord_flip()+
theme(axis.title.y= element_text(angle=0),
axis.line= element_line(color= 'grey'))+
ggtitle(label = 'Avg Joviality of Participants in Each Interest Group\n Between Those With and Without Kids')+
theme(plot.title = element_text(size=14, face="bold",hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))+
annotate("text",
x = 9,
y = 0.60,
label = "Only Participants with kids\n in J, I, G, F, B are happier \n than those without kids.",size=4,color='mediumvioletred')
Clarity:
(+) It is helpful to provide definition of the data (Joviality) and plot title for users to better understand the purpose of the chart.
(-) However, without drawing any observations, we are not able to truly appreciate the author’s purpose of the chart. We can only infer from purpose of raincloud plot which is to visualize raw data, the distribution of the data, and key summary statistics at the same time.
Aesthetics:
(+) The plots are stacked for easy comparison.
(-) The education level could be rearranged from lowest to highest level for easier readability and comparison.
Changes:
Other than addressing the (-) above, the following changes were also made:
p4b <- ggplot(demographic_data, aes(x = Education_Level, y = Joviality, fill=Education_Level)) +
ggdist::stat_halfeye(
adjust = .5,
width = .6,
.width = 0,
justification = -.3,
point_colour = NA) +
geom_boxplot(
width = .25,
outlier.shape = NA
) +
geom_point(
size = 1.3,
alpha = .3,
position = position_jitter(
seed = 1, width = .1
)
) +
coord_cartesian(xlim = c(1.2, NA), clip = "off")+
ggtitle(label = "Joviality Distribution for Different Education Level",
subtitle = "Jovality are quite well spread within each education level,\n with higher educated particpants being happier.")+
theme_minimal()+
theme(plot.title = element_text(size=14, face="bold",hjust = 0.5),
plot.subtitle = element_text(size=12,hjust = 0.5,color='mediumvioletred'))+
theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
panel.background= element_blank(), axis.line= element_line(color= 'grey'))
Clarity:
(+) It is helpful to include plot title for users to better understand the purpose of the chart.
(-) Ridgeline plots are usually meant for visualizing changes in distributions over time or space, which is not the case for this dataset. Without providing any further explanation or summary of the chart we are not able to truly appreciate the author’s purpose of the chart.
Aesthetics:
(+) The plots are clean.
(-) The plots does not highlight any key information to communication the author’s intent.
Changes:
We suppose that the author would like to see the distribution of joviality across interest groups, as such we continue to use Raincloud plot instead of a Ridge plot. Other than addressing the above (-), the following changes were also made:
p5b <- ggplot(demographic_data, aes(x = Interest_Group, y = Joviality, fill=Interest_Group)) +
stat_halfeye(adjust = .35,
width = .6,
color = 'black',
justification = -.15,
position = position_nudge(x = .12)) +
geom_hline(aes(yintercept = 0.5),
linetype= 'dashed',
color= 'blue',
size= .6) +
coord_flip() +
ggtitle(label = "Joviality Distribution for Different Interest Groups",
subtitle = "Jovality are quite well spread within each interest group,\n interest group E has the highest median joviality index while interest group H has the lowest.")+
theme_minimal()+
theme(plot.title = element_text(size=14, face="bold",hjust = 0.5),
plot.subtitle = element_text(size=12,hjust = 0.5,color='mediumvioletred'))+
theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
panel.background= element_blank(), axis.line= element_line(color= 'grey'))
Clarity:
(+) It is helpful to provide an overview with an overall title of all the plots discussed thus far. The plots are also well labelled for clarity.
(-) An overall summary would be most helpful to conclude the discussion.
Aesthetics:
(+) The plots are well sized and colored coordinated. The panel background is also changed to enhance visualisation.
(-) The plots did not highlight any key information to call on users’ attention.
Changes:
Other than the changes already made to the individual charts, we also include an overall summary as the plot title.
patchwork <- ((p1b / p2b)| p3b)/(p4b | p5b) +
plot_annotation(tag_levels = 'I',
title = 'The participants are from small family (<3) and most of them are high school or collegue graduate.
\nAll of them take part in an interest group and those without kids have higher mean joviality.
\nAmong the participants, those of higher education level or take part in interest group E also seems to have higher median joviality.\n', caption='Demographic of the City of Engagement, Ohio USA')
patchwork & theme_economist()
With this, we conclude this exercise where we have make various changes to the 5 sections of charts in attempt to make them clearer and aesthetically easier for interpretation.