dplyr IBy the end of the lecture, you will be able to …
Download and open code-along-03.qmd
Load the standard packages.
Use the gss_which_years() function to identify the survey years a variable appeared.
# A tibble: 35 × 2
year meovrwrk
<dbl+lbl> <lgl>
1 1972 FALSE
2 1973 FALSE
3 1974 FALSE
4 1975 FALSE
5 1976 FALSE
6 1977 FALSE
7 1978 FALSE
8 1980 FALSE
9 1982 FALSE
10 1983 FALSE
# ℹ 25 more rows
Heads Up!
If run in the console, to see all rows, wrap the code in the print() command: print(gss_which_years(gss_all, meovrwrk), n = 40)
Which survey years did the variable childs appear?
Most tasks related to data analysis are not glorious or fancy.
A lot of time is dedicated to whipping a dataset into the shape needed to be able to analyze it.
This task has different names “data cleaning,” “data management,” “data manipulation,” “data wrangling,” “data transformation.”
dplyr packageThe dplyr package provides a complete set of functions that help you solve the most common data manipulation challenges:
function(argument)Functions are (most often) verbs, followed by what they will be applied to in parentheses:
dplyr verbs (functions) will allow you to solve the vast majority of your data manipulation challenges.
dplyr basicsThey are organized into four groups based on what they operate on: rows, columns, groups, or tables.
The verbs all have in common:
|>The pipe operator passes what comes before it into the function that comes after it as the first argument.
dplyr() in actionCreate a frequency table for childs variable.
gss_all$childs <- zap_missing(gss_all$childs)
gss_all$childs <- as_factor(gss_all$childs)
gss_all$childs <- droplevels(gss_all$childs)
gss_all |>
freq(childs, report.nas = FALSE) |>
tb()dplyr grammar, starting with the name of the df and a pipe
freq() function as usual
tb() function to turn the table into a tibble
dplyr() in action# A tibble: 9 × 4
childs freq pct pct_cum
<fct> <dbl> <dbl> <dbl>
1 0 20956 27.8 27.8
2 1 11979 15.9 43.7
3 2 18989 25.2 68.9
4 3 11671 15.5 84.3
5 4 5996 7.95 92.3
6 5 2669 3.54 95.8
7 6 1381 1.83 97.7
8 7 734 0.973 98.6
9 8 or more 1032 1.37 100
Use a |> & gss_which_years() to find out which survey years the variable agekdbrn appears.
dplyr styleIn data transformation pipelines, always use a
|>|>We’ll talk about data visualization pipes later…
Heads Up!
|> (native pipe operator) and %>% (magrittr package) behave identically for simple cases. More info.
dplyr grammarWhat’s the advantage of dplyr grammar? We can sequence data manipulation!
filter() & drop_na()Use filter() to keep rows that meet a condition.
Use drop_na() to remove rows with missing (NA) values.
gss_all data frame:
# A tibble: 2 × 2
sex n
<dbl+lbl> <int>
1 1 [male] 1031
2 2 [female] 1363
Limit the dataframe to only the 2024 respondents without missing data for sex, hrs1, and childs. How many men and women are left?
na.rm is a logical evaluating to TRUE or FALSE indicating whether NA values should be stripped before the computation proceeds.
[1] 23
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
9.00 20.00 23.00 24.14 28.00 97.00 46832
Find the mean and median for the variable: hrs1.
summarize()Use functions inside summarize() to create summary variables.
dplyrUse dplyr grammar to find the frequency, mean, and median.
Add lots of summary variables using functions inside summarize().
n() function to create a count variable (no variable in parentheses)
median() function on a variable
mean() function on a variable
# A tibble: 1 × 3
freq med avg
<int> <dbl> <dbl>
1 28867 23 24.1
Use dplyr grammar to find the frequency, mean, and median for the variable: hrs1.
group_by() and summarize()Use group_by() to organize your data into groups based on one or more variables.
Use summarize() to compute statistics like total, mean, or median for each group.
gss_all data frame:
# A tibble: 2 × 2
sex avg
<dbl+lbl> <dbl>
1 1 [male] 25.7
2 2 [female] 23.0
Compare the median and average work hours (hrs1) for U.S. men and women.
summarize() + variability measures# A tibble: 2 × 7
`as_factor(sex)` freq min median max mean sd
<fct> <int> <dbl+lbl> <dbl> <dbl+lbl> <dbl> <dbl>
1 male 11790 9 25 65 25.7 5.85
2 female 17009 9 22 57 23.0 5.30
How did the mean and standard deviation of weekly work hours differ between men and women in 2024, and what do these differences suggest about gender-based variability in labor patterns?
How do we find out?
Can you produce this table?
# A tibble: 2 × 3
`as_factor(sex)` mean sd
<fct> <dbl> <dbl>
1 male 41.7 13.7
2 female 37.3 13.7
What’s your conclusion to our research question?