This page gives a brief introduction to dplyr
and some of the data manipulations that it can perform. We will use the gapminder
dataset to demonstrate how to select columns, filter by rows, sort the data, and create new variables.
For more about the tidyverse go to https://www.tidyverse.org/. For an intro to dplyr
check out Hadley Wickham’s video. For useful cheat sheets on dplyr
, the tidyverse, and more check https://www.rstudio.com/resources/cheatsheets/.
# Load the packages needed for this practical
library(gapminder)
library(dplyr)
R has been around since the early 1990’s and since then many developments have occurred within the language. A recent major development is the invention of the tidyverse. The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
The tidyverse packages aim to help create and work with tidy data. Tidy data is a format for storing your data sets that make it easier to analyse. Tidy data will have the following properties:
One of the main packages in the tidyverse for working with tidy data is dplyr
and we will now look at how it works.
dplyr
is an R package that provides a grammar of data manipulation. The package is part of the tidyverse and it simplifies the process of common data manipulation tasks. Using dplyr
requires learning the functions that it provides. In keeping with the package being a grammar, these functions are sometimes called verbs. The verbs/functions we will look at now are:
select()
picks variables based on their names.rename()
rename variables.filter()
picks rows based on their values.filter to get certain rowsarrange()
changes the ordering of the rows.mutate()
adds new variables that are functions of existing variables.select()
is keep
or drop
filter()
is keep if ...
or drop if ...
arrange()
is sort
or gsort
mutate()
is gen
or egen
The syntax for these verbs are nearly always they same. They require the name of the data frame we wish to manipulate, and the subsequent arguments relate to the variables we wish to work with.
For instance, if we want to select the country variable from the gapminder
data frame, we type
select(gapminder, country) # select a column
select(gapminder, -country) # deselect a column
If we want to filter the gapminder
data frame to be only observations from the year 1952 then we use filter()
filter(gapminder, year == 1952) # Select only from 1952
filter(gapminder, year == 1952 & continent == "Africa") # African countries from 1952
# Either 1952 or countries from Africa at anyyear
filter(gapminder, year == 1952 | continent == "Africa")
Operators | Description |
---|---|
== | equal to |
!= | not equal to |
> | greater than |
< | less than |
>= | greater than or equal to |
=< | less than or equal to |
x & y | x and y |
x | y | x or y |
!x | not x |
x %in% y | x is in y |
To sort the data by year we can use the arrange command
arrange(gapminder, year) # default is increasing
arrange(gapminder, desc(year)) # decreasing
The mutate()
function is used to create new variables. We mutate
the data frame by adding extra variables.
mutate(gapminder, longlifeexp = lifeExp>80)
Sometimes data will need to be aggregated for analysis.
group_by()
groups the data by one or more variablessummarise()
summarise the data based on groupingsThe group_by
verb can be used to group data. Unlike the previous verbs where they were used on their own, group_by()
only becomes useful when combined with other verbs like summarise()
. summarise()
can be used to calculate the aggregate values for the group. The n()
function counts the number of rows in each category. It is a helper function within the package dplyr
.
gapminderworld <- group_by(gapminder, year)
summarise(gapminderworld, meanLifeExp = mean(lifeExp), countries = n())
Operators | Description |
---|---|
n() | count rows in each group |
min() | minimum value |
mean() | mean |
sd() | standard deviation |
max() | maximum value |
first() | first value in each group |
last() | last value in each group |
The verbs in dplyr
allow for a wide range of manipulations that are required in data analysis. There are more verbs and they can be found by searching for dplyr
or visiting http://dplyr.tidyverse.org/. As we have seen the verbs can be used one at a time which creates a new data frame each time. Now we will look at how the verbs can be put together to create sentences using an object called a pipe.
A pipe place the object on the current line as the first argument in the next functions. This can sound a bit confusing at first but we’ll go through an example to show what this means.
In R a pipe is represented by the symbol %>%. This symbol can be created in RStudio using Ctrl+shift+m. With the vocabulary of dplyr
and more generally the %>% can be read as saying “and then.
# Piping
gapminder %>%
group_by(year) %>%
summarise(meanLifeExp = mean(lifeExp), countries = n())
# Without piping
summarise(group_by(gapminder, year), meanLifeExp = mean(lifeExp), countries = n())
This code can be read as “Take the data frame gapminder, and then group
the data by year, and then, summarise
the groups by calculating the mean life expectancy and count the number of countries in each group
The pipe operator allows even more complex data manipulations to be performed in one function call.
gapminder %>%
select(year, continent, lifeExp) %>%
filter(continent== "Asia") %>%
group_by(year) %>%
summarise(meanLifeExp = mean(lifeExp), countries = n())
This code can be read as “Take the data frame gapminder, and then select
the variables year, continent, and lifeExp, and then filter
the row where the variable continent equals Asia, and then group by
by year, and then summarise
by calculating the mean life expectancy and count the number of countries in each group”
dplyr
There are a bunch of helper functions which can be used to select columns with common features.
For instance, we can select all the columns that have a similar prefix or suffix.
gapminder %>%
select(starts_with("c")) # Select columns that start with c
gapminder %>%
select(ends_with("p")) # Select columns that end with p
There are variants of the dplyr
verbs for more complex situations. The variants have a either an _if
, _all
or _at
. These can be used when performing manipulations on more than one variables. We will look at the the variants for summarise
summarise_all
summarise_if
summarise_at
gapminder %>%
summarise_all(mean)
gapminder %>%
summarise_if(is.double, mean)
gapminder %>%
summarise_if(is.integer, mean)
gapminder %>%
summarise_at(c("lifeExp", "pop"), mean)