Deploying skills from in Lesson 1 and Hands-on Exercise 1
In this take-home exercise, we will use appropriate statistical graphics methods to reveal the demographic of the city of Engagement, Ohio USA from the VAST Challenge 2022.
Demographic is the statistical characteristics of a given population such as age, gender, ethnicity, income, level of education, religion, occupation, family structure etc.
In this exercise, the data will be processed using appropriate tidyverse family of packages and the statistical graphics using ggplot2 and its extensions.
The packages tidyverse (including dplyr, magrittr, ggplot2, patchwork), skimr, ggrepel will be used for the purpose of this exercise:
The code chunk below is used to install and load the required packages onto RStudio.
packages = c('tidyverse','skimr','ggrepel')
for(p in packages){
if(!require(p, character.only =T)){
install.packages(p)
}
library(p, character.only =T)
}
About 1000 representative residents have volunteered to provide data using the city’s urban planning app, which records the places they visit, their spending, and their purchases, among other things; totaling to 3 dataset folders (Activity Logs, Attributes, Journals). We will only be using the dataset Participants from the Attributes folder for this exercise.
The code chuck below import Participants.csv from the data
folder into R by using read_csv() and save it as an tibble
data frame called demographic_data.
demographic_data <- read_csv("data/Participants.csv")
First, let’s get a general sense of our data using the function
glimpse() from the library dplyr.
Rows: 1,011
Columns: 7
$ participantId <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,~
$ householdSize <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~
$ haveKids <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU~
$ age <dbl> 36, 25, 35, 21, 43, 32, 26, 27, 20, 35, 48, 2~
$ educationLevel <chr> "HighSchoolOrCollege", "HighSchoolOrCollege",~
$ interestGroup <chr> "H", "B", "A", "I", "H", "D", "I", "A", "G", ~
$ joviality <dbl> 0.001626703, 0.328086500, 0.393469590, 0.1380~
We see that there are a total of 1011 participants (rows) with 7 attributes (columns).
Some attributes that we would think of as categorical data (eg.
educationLevel) are currently of class character as indicated
by the <chr> just to the right of the column names in
the glimpse() output. This means that the variable values
are character strings, such as words or phrases.
Some attributes are of class <dbl>, which stands
for double precision which indicates that they are numeric and that they
have decimal values. We would have prefer some of the attributes eg.
householdSize to be integer values which would not allow for
decimal numbers.
Lastly, there is a attribute of class <lgl>, which
stands for logical which indicates a boolean value of true or false.
Instead of a true/false data, we would prefer to have a data value of
yes/no to whether the participants have kids for an easier reading of
the charts in our subsequent exercise. Therefore, a new column
(AnyKids) is added using the function mutate() and
case_when() with the code below:
There is another common data class, factor, which is abbreviated as
<fct>. A factor is something that has unique levels
but there is no appreciable order to the levels. For example, here
participantId has numeric value when it is just an id for each
participant; we want it to be interpreted as just a unique level and not
as the number.
Hence, we will convert participantId into factors by using
the across() function of the dplyr package and
the as.factor() base function. The across()
function has two main arguments: (i) the columns we want to operate on
and (ii) the function or list of functions to apply to each column.
library(magrittr)
demographic_data <-demographic_data %>%
mutate(across(c(participantId), as.factor))
#let's confirm the change
glimpse(demographic_data)
Rows: 1,011
Columns: 8
$ participantId <fct> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,~
$ householdSize <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~
$ haveKids <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU~
$ age <dbl> 36, 25, 35, 21, 43, 32, 26, 27, 20, 35, 48, 2~
$ educationLevel <chr> "HighSchoolOrCollege", "HighSchoolOrCollege",~
$ interestGroup <chr> "H", "B", "A", "I", "H", "D", "I", "A", "G", ~
$ joviality <dbl> 0.001626703, 0.328086500, 0.393469590, 0.1380~
$ AnyKids <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye~
We can see that participantId is now factor as indicated by
The skim() function of the skimr
package is also helpful for getting a general sense of our data. By
design, it provides summary statistics about the attributes in the
dataset.
| Name | demographic_data |
| Number of rows | 1011 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| factor | 1 |
| logical | 1 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| educationLevel | 0 | 1 | 3 | 19 | 0 | 4 | 0 |
| interestGroup | 0 | 1 | 1 | 1 | 0 | 10 | 0 |
| AnyKids | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| participantId | 0 | 1 | FALSE | 1011 | 0: 1, 1: 1, 2: 1, 3: 1 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| haveKids | 0 | 1 | 0.3 | FAL: 710, TRU: 301 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| householdSize | 0 | 1 | 1.96 | 0.79 | 1 | 1.00 | 2.00 | 3.00 | 3 | ▇▁▇▁▆ |
| age | 0 | 1 | 39.07 | 12.38 | 18 | 29.00 | 39.00 | 50.00 | 60 | ▇▇▇▇▇ |
| joviality | 0 | 1 | 0.49 | 0.29 | 0 | 0.24 | 0.48 | 0.75 | 1 | ▇▇▇▇▇ |
The function provides separate tables of summary statistics for each
data type: factor, numeric and logical. There is a column called
n_missing for the number of values that are missing. We
confirmed that there are no missing data in our dataset.
These participants are all adults ranging from a minimum age of 18 to a maximum of 60 years old, with a mean age of 39 years old which is considerably young. Since the participants need to use and navigate the city’s urban planning app for data collection, perhaps that is why we see only participants from this age range who are supposedly more tech savvy and accustomed to app usage.
In this section, we use various charts to understand the demographic of these participants from the city of Engagement, Ohio USA.
A chart for the age distribution of these participants are prepared
with the following changes to the ggplot():
ggplot(data=demographic_data, aes(x = age)) +
geom_histogram(bins=10,color="black",fill="skyblue3")+
geom_vline(aes(xintercept=mean(age,na.rm=T)),color="red",linetype="dashed",size=1)+
xlab("Age") +
ylab("No. of Participants")+
xlim(18,60)+
ggtitle(label = 'Population Across Age Group',
subtitle = '1011 participants are grouped in 10 bins according to their age.')+
theme(plot.title = element_text(size=14, face="bold",hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
We see that participants are quite well distributed across all the age groups with lesser representative in the late 30s range.
Next, we want to see the household size distribution of these participants.
ggplot(data=demographic_data, aes(x = householdSize)) +
geom_bar(color="black",fill="skyblue3") +
geom_text(stat="count",aes(label=paste0(..count..,", ",round(..count../sum(..count..)*100,1),"%")),vjust=-1) +
ylim(0,400) +
xlab("Household Size") +
ylab("No. of Participants")+
ggtitle('Household Size Distribution')+
theme(plot.title = element_text(size=14, face="bold",hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
The household sizes are relatively small ranging from 1 to 3 with almost equal portion across the 3 groups.
Let’s take a look at the age distribution within each househould sizes.
ggplot(data=demographic_data, aes(x = age)) +
geom_histogram(color="black",fill="skyblue3", bins=10) +
xlim(18,60)+
xlab("Age") +
ylab("No. of Participants")+
ggtitle('Age Distribution within Each Household Size')+
theme(plot.title = element_text(size=14, face="bold",hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))+
facet_grid(~ householdSize)
The singles (i.e. househould size = 1) are mostly in their late 40s, and we see mostly young families (late 20s to early 30s) among the participants.
We will use factor() to specify the factor level order
from lowest to the highest education level for a more intuitive
interpretation of the chart.
demographic_data$educationLevel = factor(demographic_data$educationLevel, levels = c('Low', 'HighSchoolOrCollege', 'Bachelors','Graduate'))
ggplot(data=demographic_data,aes(educationLevel)) +
geom_bar(color="black",fill="skyblue3") +
geom_text(stat="count",aes(label=paste0(..count..,", ",round(..count../sum(..count..)*100,1),"%")),vjust=-1) +
ylim (0, 600) +
xlab("Education Level") +
ylab("No. of Participants") +
ggtitle(label = 'Education Level of Participants',
subtitle = 'The participants are grouped according to their education level.')+
theme(plot.title = element_text(size=14, face="bold",hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
We see that more than 50% of the participants is a high school or college graduate, with about 40% of them holding a min. bachelors degree.
It is commonly viewed that higher educated females tends to have lesser kids. While we do not have the gender of these participants, the assumption is that married couple are usually of similar educational background and hence viewed as an unit i.e. higher educated parents tends to have lesser kids.
With this in mind, we layered the AnyKids attributes onto each education level category.
ggplot(data=demographic_data,aes(x=educationLevel, fill=AnyKids)) +
geom_bar(color="black", position = "dodge") +
ylim (0, 400) +
xlab("Education Level") +
ylab("No. of Participants") +
ggtitle(label = 'Proportion of Participants with Kids for Each Education Category',
subtitle = 'The participants are grouped according to their education level and if they have any kids.')+
theme(plot.title = element_text(size=14, face="bold",hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
When we stacked the bars side by side using
position = "dodge" within each category, it is apparent
that parents of lower educational background do have more
kids (i.e. almost half of them have kids) compared to those of
higher educational background where only about one-third of them have
kids.
Now, let us see what are the relationship of the joviality of the participants with the various attributes. We prepared four different charts to compare joviality with whether they have kids, their education level, age and interest group. We compare the charts together using the ~ library.
library(patchwork)
p1 <- ggplot(data=demographic_data,aes(x=AnyKids, y=joviality)) +
geom_boxplot(color="black",fill="skyblue3")+
xlab("Any Kids") +
ylab("Joviality") +
ggtitle('Kids & Joviality')+
theme(plot.title = element_text(size=12, face="bold",hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
p2 <- ggplot(data=demographic_data) +
geom_boxplot(aes(educationLevel, joviality),
color="black",fill="skyblue3") +
xlab("Education Level") +
ylab("Joviality") +
theme(plot.title = element_text(size=12, face="bold",hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))+
ggtitle('Education Level & Joviality')
p3 <- ggplot(data=demographic_data) +
geom_boxplot(aes(interestGroup, joviality),
color="black",fill="skyblue3")+
xlab("Interest Group") +
ylab("Joviality") +
theme(plot.title = element_text(size=12, face="bold",hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))+
ggtitle('Interest Group & Joviality')
p4 <- ggplot(data=demographic_data,
aes(x=age, y=joviality)) +
geom_point(stat = "summary", fun = "median") +
xlab("Age") +
ylab("Joviality") +
theme(plot.title = element_text(size=12, face="bold",hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))+
ggtitle('Age & Joviality')
(p1 /p4) | (p2 / p3)
In general, it seems that there are not much distinct difference in the degree of joviality across all attributes i.e. everyone is as cheerful as one another. A closer look, perhaps participants who have kids or hold a graduate degree or take part in interest group “E” are typical more jovial than their peers.
We proceed to create a trellis plot with educational level, kids, interest group to see which combination may gives a more jovial individual. For an easier comparison, an arbitrary line of 0.8 (joviality) is inserted across all panels.
ggplot(data=demographic_data) +
geom_bar(aes(educationLevel, joviality,fill=AnyKids),
color="black", position = "dodge", stat = "summary", fun = "median")+
geom_hline(yintercept=0.8, linetype="dashed", color = "red")+
xlab("Education Level") +
ylab("Joviality") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
coord_flip()+
ggtitle(label = 'Education Level, Kids, Interest Group Affects Joviality?',
subtitle = 'The participants are grouped according to education level, if they have kids and which interest group they joined.')+
theme(plot.title = element_text(size=14, face="bold",hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))+
facet_wrap(AnyKids~interestGroup)
We see that the most jovial participants may be the graduates with kids in interest group H, and those with bachelors degree with kids in interest Group I.
We have used some simple charts in attempt to illustrate the demographic of the city of Engagement, Ohio USA using the 7 attributes of the 1011 participants.
We see that the participants ranged from young adults to pre-retirement age, mostly with high school/college qualification. Their family size are typically small of not more than 3. The singles are mostly in their late 40s, and there are more young families (late 20s to late 30s) within them.
All of them also participated in some sort of an interest group. The more jovial individuals among them may be the graduates with kids in interest group H, and those with bachelors degree with kids in interest Group I.