Data Management I

Agenda

  • dplyr I
  • Summary Statistics

Learning objectives

By the end of the lecture, you will be able to …

  • implement some basic variable manipulation
  • produce measures of central tendency and variability

Code-along 03

Download and open code-along-03.qmd

Packages

Load the standard packages.

library(tidyverse) 
library(haven) # not core tidyverse
library(gssr)
library(gssrdoc) # load the GSS codebook
library(summarytools)

Load your data

# Get the data all survey years
data(gss_all)

Use the gss_which_years() function to identify the survey years a variable appeared.


gss_which_years(gss_all, meovrwrk)
# A tibble: 35 × 2
   year      meovrwrk
   <dbl+lbl> <lgl>   
 1 1972      FALSE   
 2 1973      FALSE   
 3 1974      FALSE   
 4 1975      FALSE   
 5 1976      FALSE   
 6 1977      FALSE   
 7 1978      FALSE   
 8 1980      FALSE   
 9 1982      FALSE   
10 1983      FALSE   
# ℹ 25 more rows


Heads Up!

If run in the console, to see all rows, wrap the code in the print() command: print(gss_which_years(gss_all, meovrwrk), n = 40)

Mini-task

Which survey years did the variable childs appear?


gss_which_years(gss_all, childs)
# A tibble: 35 × 2
   year      childs
   <dbl+lbl> <lgl> 
 1 1972      TRUE  
 2 1973      TRUE  
 3 1974      TRUE  
 4 1975      TRUE  
 5 1976      TRUE  
 6 1977      TRUE  
 7 1978      TRUE  
 8 1980      TRUE  
 9 1982      TRUE  
10 1983      TRUE  
# ℹ 25 more rows

dplyr I

Tidying Data

Most tasks related to data analysis are not glorious or fancy.


A lot of time is dedicated to whipping a dataset into the shape needed to be able to analyze it.


This task has different names “data cleaning,” “data management,” “data manipulation,” “data wrangling,” “data transformation.”

dplyr package

The dplyr package provides a complete set of functions that help you solve the most common data manipulation challenges:

  • filtering observations based on their values
  • extracting observations based on their values or positions
  • sampling observations based on a specific number or fraction of rows
  • sorting observations based on one or several variables
  • selecting variables based on their names or positions
  • renaming variables
  • adding new variables based on existing ones
  • summarizing observations or variables to a single descriptive measure
  • performing any operation by grouping

function(argument)

Functions are (most often) verbs, followed by what they will be applied to in parentheses:


do_this(to_this)
do_that(to_this, to_that, with_those)


dplyr verbs (functions) will allow you to solve the vast majority of your data manipulation challenges.

dplyr basics

They are organized into four groups based on what they operate on: rows, columns, groups, or tables.


The verbs all have in common:

  1. The first argument is always a data frame.
  2. The subsequent arguments typically describe which columns to operate on using the variable names (without quotes).
  3. The output is always a new data frame.

The pipe |>

The pipe operator passes what comes before it into the function that comes after it as the first argument.


sum(1, 2)
[1] 3


1 |> 
  sum(2)
[1] 3

dplyr() in action

Create a frequency table for childs variable.

gss_all$childs <- zap_missing(gss_all$childs) 
gss_all$childs <- as_factor(gss_all$childs)   
gss_all$childs <- droplevels(gss_all$childs)  

gss_all |>
  freq(childs, report.nas = FALSE) |>
  tb()
1
Use dplyr grammar, starting with the name of the df and a pipe
2
Use the freq() function as usual
3
Add the tb() function to turn the table into a tibble

dplyr() in action

# A tibble: 9 × 4
  childs     freq    pct pct_cum
  <fct>     <dbl>  <dbl>   <dbl>
1 0         20956 27.8      27.8
2 1         11979 15.9      43.7
3 2         18989 25.2      68.9
4 3         11671 15.5      84.3
5 4          5996  7.95     92.3
6 5          2669  3.54     95.8
7 6          1381  1.83     97.7
8 7           734  0.973    98.6
9 8 or more  1032  1.37    100  

Mini-task

Use a |> & gss_which_years() to find out which survey years the variable agekdbrn appears.

gss_all |>
  gss_which_years(agekdbrn)
# A tibble: 35 × 2
   year      agekdbrn
   <dbl+lbl> <lgl>   
 1 1972      FALSE   
 2 1973      FALSE   
 3 1974      FALSE   
 4 1975      FALSE   
 5 1976      FALSE   
 6 1977      FALSE   
 7 1978      FALSE   
 8 1980      FALSE   
 9 1982      FALSE   
10 1983      FALSE   
# ℹ 25 more rows

dplyr style

In data transformation pipelines, always use a

  • space before |>
  • line break after |>
  • indent the next line of code

We’ll talk about data visualization pipes later…


Heads Up!

|> (native pipe operator) and %>% (magrittr package) behave identically for simple cases. More info.

dplyr grammar

What’s the advantage of dplyr grammar? We can sequence data manipulation!


gss_all |> 
  filter(year == 2022) |>
  drop_na(sex, agekdbrn) |>
  group_by(sex) |>
  summarize(avg = mean(agekdbrn))
# A tibble: 2 × 2
  sex          avg
  <dbl+lbl>  <dbl>
1 1 [male]    26.3
2 2 [female]  23.7

filter() & drop_na()

Use filter() to keep rows that meet a condition.
Use drop_na() to remove rows with missing (NA) values.


gss_all |>
  filter(year == 2022) |>
  drop_na(sex, agekdbrn)|>
  count(sex)
1
Start with the gss_all data frame:
2
Keep only the respondents from the 2022 survey
3
Remove any observations with missing data for our key variables
4
Tally how many complete entries remain for each gender
# A tibble: 2 × 2
  sex            n
  <dbl+lbl>  <int>
1 1 [male]    1031
2 2 [female]  1363

Mini-task

Limit the dataframe to only the 2024 respondents without missing data for sex, hrs1, and childs. How many men and women are left?

gss_all |> 
  filter(year == 2024) |> 
  drop_na(sex, hrs1, childs) |>
  count(sex)
# A tibble: 2 × 2
  sex            n
  <dbl+lbl>  <int>
1 1 [male]     859
2 2 [female]   888

Summary Statistics

Median & Mean w base R

# Using base R
median(gss_all$agekdbrn, na.rm=TRUE)
1
na.rm is a logical evaluating to TRUE or FALSE indicating whether NA values should be stripped before the computation proceeds.
[1] 23


mean(gss_all$agekdbrn, na.rm=TRUE)
[1] 24.14491


# show me summary statistics
summary(gss_all$agekdbrn)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   9.00   20.00   23.00   24.14   28.00   97.00   46832 

Mini-task

Find the mean and median for the variable: hrs1.

# separately
median(gss_all$hrs1, na.rm=TRUE)
mean(gss_all$hrs1, na.rm=TRUE)

# together
summary(gss_all$hrs1)
[1] 40
[1] 41.11279
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   37.00   40.00   41.11   48.00   89.00   32371 

summarize()

Use functions inside summarize() to create summary variables.


gss_all |>  
  drop_na(agekdbrn) |>
   summarize(
    avg = mean(agekdbrn)
    ) 
# A tibble: 1 × 1
    avg
  <dbl>
1  24.1

Median & Mean with dplyr

Use dplyr grammar to find the frequency, mean, and median.


Add lots of summary variables using functions inside summarize().

gss_all |>  
  drop_na(agekdbrn) |> 
  summarize(
    freq = n(),
    med = median(agekdbrn),
    avg = mean(agekdbrn)
    ) 
1
Use n() function to create a count variable (no variable in parentheses)
2
Use median() function on a variable
3
Use mean() function on a variable
# A tibble: 1 × 3
   freq   med   avg
  <int> <dbl> <dbl>
1 28867    23  24.1

Mini-task

Use dplyr grammar to find the frequency, mean, and median for the variable: hrs1.

gss_all |>  
  drop_na(hrs1) |>
   summarize(
    freq = n(),
    med = median(hrs1),
    avg = mean(hrs1)
    ) 
# A tibble: 1 × 3
   freq   med   avg
  <int> <dbl> <dbl>
1 43328    40  41.1

group_by() and summarize()

Use group_by() to organize your data into groups based on one or more variables.
Use summarize() to compute statistics like total, mean, or median for each group.

gss_all |>
  drop_na(sex, agekdbrn) |>
  group_by(sex) |>
  summarize(avg = mean(agekdbrn))
1
Start with the gss_all data frame:
2
Remove any observations with missing data for our key variables
3
Do the next steps separately for each group in the variable
4
Creates a new data frame with one row for each combination of grouping variables
# A tibble: 2 × 2
  sex          avg
  <dbl+lbl>  <dbl>
1 1 [male]    25.7
2 2 [female]  23.0

Mini-task

Compare the median and average work hours (hrs1) for U.S. men and women.

gss_all |>  
  drop_na(sex, hrs1) |> 
  group_by(sex) |> 
  summarize(
    med = median(hrs1),
    avg = mean(hrs1)
    ) 
# A tibble: 2 × 3
  sex          med   avg
  <dbl+lbl>  <dbl> <dbl>
1 1 [male]      40  44.3
2 2 [female]    40  37.8

summarize() + variability measures

gss_all |>  
  drop_na(agekdbrn, sex)  |>
  group_by(as_factor(sex)) |>  
  summarize(
    freq = n(),
    min = min(agekdbrn),
    median = median(agekdbrn),
    max = max(agekdbrn),
    mean = round(mean(agekdbrn), digits = 2), 
    sd = sd(agekdbrn)
    ) 
# A tibble: 2 × 7
  `as_factor(sex)`  freq min       median max        mean    sd
  <fct>            <int> <dbl+lbl>  <dbl> <dbl+lbl> <dbl> <dbl>
1 male             11790 9             25 65         25.7  5.85
2 female           17009 9             22 57         23.0  5.30

Think Like a Statistician

How did the mean and standard deviation of weekly work hours differ between men and women in 2024, and what do these differences suggest about gender-based variability in labor patterns?


How do we find out?

Think Like a Statistician

Can you produce this table?

# A tibble: 2 × 3
  `as_factor(sex)`  mean    sd
  <fct>            <dbl> <dbl>
1 male              41.7  13.7
2 female            37.3  13.7

Your Data Take


What’s your conclusion to our research question?