1. The Task

In this take-home exercise, we will use appropriate statistical graphics methods to reveal the demographic of the city of Engagement, Ohio USA from the VAST Challenge 2022.

Demographic is the statistical characteristics of a given population such as age, gender, ethnicity, income, level of education, religion, occupation, family structure etc.

In this exercise, the data will be processed using appropriate tidyverse family of packages and the statistical graphics using ggplot2 and its extensions.

2. Exploration of Data

2.1 Importing the relevant packages

The packages tidyverse (including dplyr, magrittr, ggplot2, patchwork), skimr, ggrepel will be used for the purpose of this exercise:

The code chunk below is used to install and load the required packages onto RStudio.

packages = c('tidyverse','skimr','ggrepel')

for(p in packages){
  if(!require(p, character.only =T)){
    install.packages(p)
    }
  library(p, character.only =T)
}

2.2 Importing Data

About 1000 representative residents have volunteered to provide data using the city’s urban planning app, which records the places they visit, their spending, and their purchases, among other things; totaling to 3 dataset folders (Activity Logs, Attributes, Journals). We will only be using the dataset Participants from the Attributes folder for this exercise.

The code chuck below import Participants.csv from the data folder into R by using read_csv() and save it as an tibble data frame called demographic_data.

demographic_data <- read_csv("data/Participants.csv")

3. Data Exploration and Wrangling

First, let’s get a general sense of our data using the function glimpse() from the library dplyr.

3.1 Quick Overview

library(dplyr)
demographic_data %>%
  glimpse()

Rows: 1,011
Columns: 7
$ participantId  <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,~
$ householdSize  <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~
$ haveKids       <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU~
$ age            <dbl> 36, 25, 35, 21, 43, 32, 26, 27, 20, 35, 48, 2~
$ educationLevel <chr> "HighSchoolOrCollege", "HighSchoolOrCollege",~
$ interestGroup  <chr> "H", "B", "A", "I", "H", "D", "I", "A", "G", ~
$ joviality      <dbl> 0.001626703, 0.328086500, 0.393469590, 0.1380~

We see that there are a total of 1011 participants (rows) with 7 attributes (columns).

Some attributes that we would think of as categorical data (eg. educationLevel) are currently of class character as indicated by the <chr> just to the right of the column names in the glimpse() output. This means that the variable values are character strings, such as words or phrases.

Some attributes are of class <dbl>, which stands for double precision which indicates that they are numeric and that they have decimal values. We would have prefer some of the attributes eg. householdSize to be integer values which would not allow for decimal numbers.

3.2 Some Data Cleaning

Lastly, there is a attribute of class <lgl>, which stands for logical which indicates a boolean value of true or false. Instead of a true/false data, we would prefer to have a data value of yes/no to whether the participants have kids for an easier reading of the charts in our subsequent exercise. Therefore, a new column (AnyKids) is added using the function mutate() and case_when() with the code below:

demographic_data<-demographic_data %>%
  mutate(AnyKids = case_when(haveKids==TRUE ~ "Yes", haveKids==FALSE ~ "No"))

There is another common data class, factor, which is abbreviated as <fct>. A factor is something that has unique levels but there is no appreciable order to the levels. For example, here participantId has numeric value when it is just an id for each participant; we want it to be interpreted as just a unique level and not as the number.

Hence, we will convert participantId into factors by using the across() function of the dplyr package and the as.factor() base function. The across() function has two main arguments: (i) the columns we want to operate on and (ii) the function or list of functions to apply to each column.

library(magrittr)
demographic_data <-demographic_data %>%
  mutate(across(c(participantId), as.factor)) 

#let's confirm the change
glimpse(demographic_data)

Rows: 1,011
Columns: 8
$ participantId  <fct> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,~
$ householdSize  <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~
$ haveKids       <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU~
$ age            <dbl> 36, 25, 35, 21, 43, 32, 26, 27, 20, 35, 48, 2~
$ educationLevel <chr> "HighSchoolOrCollege", "HighSchoolOrCollege",~
$ interestGroup  <chr> "H", "B", "A", "I", "H", "D", "I", "A", "G", ~
$ joviality      <dbl> 0.001626703, 0.328086500, 0.393469590, 0.1380~
$ AnyKids        <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye~

We can see that participantId is now factor as indicated by after the attribute name.

3.3 Summary Statistics

The skim() function of the skimr package is also helpful for getting a general sense of our data. By design, it provides summary statistics about the attributes in the dataset.

library(skimr)
skim(demographic_data)

Table 1: Data summary
Name	demographic_data
Number of rows	1011
Number of columns	8
_______________________
Column type frequency:
character	3
factor	1
logical	1
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
educationLevel	1	3	19	4
interestGroup	1	1	1	10
AnyKids	1	2	3	2

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
participantId	0	1	FALSE	1011	0: 1, 1: 1, 2: 1, 3: 1

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
haveKids	0	1	0.3	FAL: 710, TRU: 301

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
householdSize	1	1.96	0.79	1	1.00	2.00	3.00	3	▇▁▇▁▆
age	1	39.07	12.38	18	29.00	39.00	50.00	60	▇▇▇▇▇
joviality	1	0.49	0.29	0	0.24	0.48	0.75	1	▇▇▇▇▇

The function provides separate tables of summary statistics for each data type: factor, numeric and logical. There is a column called n_missing for the number of values that are missing. We confirmed that there are no missing data in our dataset.

These participants are all adults ranging from a minimum age of 18 to a maximum of 60 years old, with a mean age of 39 years old which is considerably young. Since the participants need to use and navigate the city’s urban planning app for data collection, perhaps that is why we see only participants from this age range who are supposedly more tech savvy and accustomed to app usage.

4. Visualizing with Charts

In this section, we use various charts to understand the demographic of these participants from the city of Engagement, Ohio USA.

4.1 Age

A chart for the age distribution of these participants are prepared with the following changes to the ggplot():

use the bins argument to change the number of bins to 10. We reduced the bins to 10 since the age range of 18 to 60 is relatively small.
use fill argument is to shade the histogram with light blue color
use geom_vline to add a mean line of 39.07 as per summary statistics
use xlim to limit the age range as per actual dataset
use color argument is to change the outline colour of the bars in black
use theme to adjust the title position and size

ggplot(data=demographic_data, aes(x = age)) +
  geom_histogram(bins=10,color="black",fill="skyblue3")+
  geom_vline(aes(xintercept=mean(age,na.rm=T)),color="red",linetype="dashed",size=1)+
  xlab("Age") + 
  ylab("No. of Participants")+
  xlim(18,60)+
  ggtitle(label = 'Population Across Age Group',
          subtitle = '1011 participants are grouped in 10 bins according to their age.')+
    theme(plot.title = element_text(size=14, face="bold",hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5))

We see that participants are quite well distributed across all the age groups with lesser representative in the late 30s range.

4.2 Household Size

Next, we want to see the household size distribution of these participants.

ggplot(data=demographic_data, aes(x = householdSize)) +
  geom_bar(color="black",fill="skyblue3") + 
  geom_text(stat="count",aes(label=paste0(..count..,", ",round(..count../sum(..count..)*100,1),"%")),vjust=-1) +
  ylim(0,400) +
  xlab("Household Size") + 
  ylab("No. of Participants")+
  ggtitle('Household Size Distribution')+
    theme(plot.title = element_text(size=14, face="bold",hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5))

The household sizes are relatively small ranging from 1 to 3 with almost equal portion across the 3 groups.

Let’s take a look at the age distribution within each househould sizes.

ggplot(data=demographic_data, aes(x = age)) +
  geom_histogram(color="black",fill="skyblue3", bins=10) + 
  xlim(18,60)+
  xlab("Age") + 
  ylab("No. of Participants")+
  ggtitle('Age Distribution within Each Household Size')+
    theme(plot.title = element_text(size=14, face="bold",hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5))+
  facet_grid(~ householdSize)

The singles (i.e. househould size = 1) are mostly in their late 40s, and we see mostly young families (late 20s to early 30s) among the participants.

4.3 Education Level

We will use factor() to specify the factor level order from lowest to the highest education level for a more intuitive interpretation of the chart.

demographic_data$educationLevel = factor(demographic_data$educationLevel, levels = c('Low', 'HighSchoolOrCollege', 'Bachelors','Graduate'))

ggplot(data=demographic_data,aes(educationLevel)) +
  geom_bar(color="black",fill="skyblue3") + 
  geom_text(stat="count",aes(label=paste0(..count..,", ",round(..count../sum(..count..)*100,1),"%")),vjust=-1) +
  ylim (0, 600) +
  xlab("Education Level") + 
  ylab("No. of Participants") +
    ggtitle(label = 'Education Level of Participants',
          subtitle = 'The participants are grouped according to their education level.')+
    theme(plot.title = element_text(size=14, face="bold",hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5))

We see that more than 50% of the participants is a high school or college graduate, with about 40% of them holding a min. bachelors degree.

It is commonly viewed that higher educated females tends to have lesser kids. While we do not have the gender of these participants, the assumption is that married couple are usually of similar educational background and hence viewed as an unit i.e. higher educated parents tends to have lesser kids.

With this in mind, we layered the AnyKids attributes onto each education level category.

ggplot(data=demographic_data,aes(x=educationLevel, fill=AnyKids)) +
  geom_bar(color="black", position = "dodge") + 
  ylim (0, 400) +
  xlab("Education Level") + 
  ylab("No. of Participants") +
    ggtitle(label = 'Proportion of Participants with Kids for Each Education Category',
          subtitle = 'The participants are grouped according to their education level and if they have any kids.')+
    theme(plot.title = element_text(size=14, face="bold",hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5))

When we stacked the bars side by side using position = "dodge" within each category, it is apparent that parents of lower educational background do have more kids (i.e. almost half of them have kids) compared to those of higher educational background where only about one-third of them have kids.

4.4 Joviality

Now, let us see what are the relationship of the joviality of the participants with the various attributes. We prepared four different charts to compare joviality with whether they have kids, their education level, age and interest group. We compare the charts together using the ~ library.

library(patchwork)

p1 <- ggplot(data=demographic_data,aes(x=AnyKids, y=joviality)) + 
  geom_boxplot(color="black",fill="skyblue3")+
  xlab("Any Kids") + 
  ylab("Joviality") +
  ggtitle('Kids & Joviality')+
  theme(plot.title = element_text(size=12, face="bold",hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5))

p2 <- ggplot(data=demographic_data) + 
  geom_boxplot(aes(educationLevel, joviality), 
           color="black",fill="skyblue3") +
  xlab("Education Level") + 
  ylab("Joviality") +
  theme(plot.title = element_text(size=12, face="bold",hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5))+
  ggtitle('Education Level & Joviality')

p3 <- ggplot(data=demographic_data) + 
  geom_boxplot(aes(interestGroup, joviality), 
           color="black",fill="skyblue3")+
  xlab("Interest Group") + 
  ylab("Joviality") +
  theme(plot.title = element_text(size=12, face="bold",hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5))+
  ggtitle('Interest Group & Joviality')
  
p4 <- ggplot(data=demographic_data,
  aes(x=age, y=joviality)) +
  geom_point(stat = "summary", fun = "median") + 
  xlab("Age") + 
  ylab("Joviality") +
  theme(plot.title = element_text(size=12, face="bold",hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5))+
  ggtitle('Age & Joviality')

(p1 /p4) | (p2 / p3)

In general, it seems that there are not much distinct difference in the degree of joviality across all attributes i.e. everyone is as cheerful as one another. A closer look, perhaps participants who have kids or hold a graduate degree or take part in interest group “E” are typical more jovial than their peers.

We proceed to create a trellis plot with educational level, kids, interest group to see which combination may gives a more jovial individual. For an easier comparison, an arbitrary line of 0.8 (joviality) is inserted across all panels.

ggplot(data=demographic_data) + 
  geom_bar(aes(educationLevel, joviality,fill=AnyKids), 
           color="black", position = "dodge", stat = "summary", fun = "median")+
  geom_hline(yintercept=0.8, linetype="dashed", color = "red")+
  xlab("Education Level") + 
  ylab("Joviality") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
    coord_flip()+
  ggtitle(label = 'Education Level, Kids, Interest Group Affects Joviality?',
          subtitle = 'The participants are grouped according to education level, if they have kids and which interest group they joined.')+
  theme(plot.title = element_text(size=14, face="bold",hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5))+
  facet_wrap(AnyKids~interestGroup)

We see that the most jovial participants may be the graduates with kids in interest group H, and those with bachelors degree with kids in interest Group I.

5. Concluding

We have used some simple charts in attempt to illustrate the demographic of the city of Engagement, Ohio USA using the 7 attributes of the 1011 participants.

We see that the participants ranged from young adults to pre-retirement age, mostly with high school/college qualification. Their family size are typically small of not more than 3. The singles are mostly in their late 40s, and there are more young families (late 20s to late 30s) within them.

All of them also participated in some sort of an interest group. The more jovial individuals among them may be the graduates with kids in interest group H, and those with bachelors degree with kids in interest Group I.

Take-Home Exercise 1