dplyr IIBy the end of the lecture, you will be able to …
Download and open code-along-04.qmd
Load the standard packages.
select()Use select() to choose columns in a dataframe.
head() & tail()head() looks at the first few column names and first few rows of your df.
tail() looks at the first few column names and last few rows of your df.
# A tibble: 6 × 4
year agekdbrn childs hrs1
<dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl>
1 1972 NA(i) [iap] 0 NA(i) [iap]
2 1972 NA(i) [iap] 5 NA(i) [iap]
3 1972 NA(i) [iap] 4 NA(i) [iap]
4 1972 NA(i) [iap] 0 NA(i) [iap]
5 1972 NA(i) [iap] 2 NA(i) [iap]
6 1972 NA(i) [iap] 0 NA(i) [iap]
Use tail() to look at the last few rows of my_data
# A tibble: 6 × 4
year agekdbrn childs hrs1
<dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl>
1 2024 NA(x) [not available in this release] 2 NA(i) [iap]
2 2024 NA(x) [not available in this release] 3 15
3 2024 NA(x) [not available in this release] 4 18
4 2024 NA(x) [not available in this release] 3 NA(i) [iap]
5 2024 NA(x) [not available in this release] 3 37
6 2024 NA(x) [not available in this release] 2 NA(i) [iap]
Create a dataframe with the variables year, agekdbrn, childs, and hrs1, including only respondents from the year 2022 who have no missing values in any of these variables.
# A tibble: 6 × 4
year agekdbrn childs hrs1
<dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl>
1 2022 27 1 40
2 2022 27 1 52
3 2022 28 1 31
4 2022 24 2 40
5 2022 21 2 10
6 2022 25 7 20
mutate()Use mutate() to add new columns or change existing ones.
# A tibble: 6 × 5
year agekdbrn childs hrs1 annual_hrs1
<dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl>
1 2022 27 1 40 2080
2 2022 27 1 52 2704
3 2022 28 1 31 1612
4 2022 24 2 40 2080
5 2022 21 2 10 520
6 2022 25 7 20 1040
Use mutate(), mean(), and sd() to create a column with Z scores for hrs1
# A tibble: 6 × 6
year agekdbrn childs hrs1 annual_hrs1 z_hrs1
<dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl>
1 2022 27 1 40 2080 -0.00756
2 2022 27 1 52 2704 0.820
3 2022 28 1 31 1612 -0.628
4 2022 24 2 40 2080 -0.00756
5 2022 21 2 10 520 -2.08
6 2022 25 7 20 1040 -1.39
mutate() & logical variables# A tibble: 6 × 7
year agekdbrn childs hrs1 annual_hrs1 z_hrs1 teen_parent
<dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl> <dbl>
1 2022 27 1 40 2080 -0.00756 0
2 2022 27 1 52 2704 0.820 0
3 2022 28 1 31 1612 -0.628 0
4 2022 24 2 40 2080 -0.00756 0
5 2022 21 2 10 520 -2.08 0
6 2022 25 7 20 1040 -1.39 0
What proportion of new parents were teenagers (e.g., under 18 years old)?
Use mutate() and summarise() to determine what proportion of new parents were older than 40 years old.
mutate() with case_when()Use case_when() inside mutate() to create values based on conditions.
mutate() with case_when()
1 kid 2 kids 3+ kids
1 320 0 0
2 0 497 0
3 0 0 251
4 0 0 101
5 0 0 44
6 0 0 16
7 0 0 6
8 0 0 15
What proportion of new parents had their first child in their teens, 20s, 30s, or after age 40?
Freq % % Cum.
----------- ------ -------- --------
<18 87 6.96 6.96
18–29 840 67.20 74.16
30–39 291 23.28 97.44
40+ 32 2.56 100.00
Total 1250 100.00 100.00
Heads Up!
Overwriting datasets and variables can be intentional or unintentional.
Let’s make a tiny data frame to use as an example:
Suppose you run the following and then you inspect df.
Will the x variable have values 1, 2, 3, 4, 5 or 2, 4, 6, 8, 10?
Do something and show me
Suppose you run the following and then you inspect df.
Will the x variable have values 1, 2, 3, 4, 5 or 2, 4, 6, 8, 10?
Do something, save result, overwriting original
Do something, save result, overwriting original
# A tibble: 5 × 2
x y
<dbl> <chr>
1 2 a
2 4 a
3 6 b
4 8 c
5 10 c
Do something, save result, overwriting original when you shouldn’t
Do something, save result, overwriting original
data frame
Do something and show me
gss_all |>
select(year, agekdbrn) |>
filter(year == 2022) |>
drop_na() |>
mutate(age_groups = case_when(
agekdbrn < 18 ~ "<18",
agekdbrn >= 18 & agekdbrn <= 29 ~ "18–29",
agekdbrn >= 30 & agekdbrn <= 39 ~ "30–39",
agekdbrn >= 40 ~ "40+",
TRUE ~ NA_character_
)) |>
group_by(age_groups) |>
summarise(
count = n(),
proportion = round(count / sum(count), 3)
)Do something, save result, not overwriting original.
# Do something
gss_all <- gss_all |>
mutate(age_groups = case_when(
agekdbrn < 18 ~ "<18",
agekdbrn >= 18 & agekdbrn <= 29 ~ "18–29",
agekdbrn >= 30 & agekdbrn <= 39 ~ "30–39",
agekdbrn >= 40 ~ "40+",
TRUE ~ NA_character_
))
# Now show me
gss_all |>
select(year, age_groups) |>
filter(year == 2022) |>
drop_na() |>
group_by(age_groups) |>
summarise(
count = n(),
proportion = round(count / sum(count), 3)
)Never overwrite your raw dataset!
Heads Up!
Always save transformed data as a new object, which should remain untouched for reference and reproducibility.
write_rds() & read_rds()write_rds() saves an R object to a file
read_rds() loads an R object from a .rds file
In 2024, how did the average amount of television watched differ between married people and those who were never married? And how did it compare to people who were widowed, divorced, or separated?
How do we find out?
Use select(), filter(), mutate(), group_by(), & summarise()
Can you reproduce this table?
# A tibble: 3 × 5
mar_cat count mean median sd
<chr> <int> <dbl> <dbl> <dbl>
1 Married 892 2.8 2 2.44
2 Never married 687 3.5 2 3.78
3 Widow/div/sep 565 3.87 3 3.73
What’s your conclusion to our research question?