This page gives a brief introduction to dplyr and some of the data manipulations that it can perform. We will use the gapminder dataset to demonstrate how to select columns, filter by rows, sort the data, and create new variables.

For more about the tidyverse go to https://www.tidyverse.org/. For an intro to dplyr check out Hadley Wickham’s video. For useful cheat sheets on dplyr, the tidyverse, and more check https://www.rstudio.com/resources/cheatsheets/.

Packages

# Load the packages needed for this practical
library(gapminder)
library(dplyr)

 

1 The Tidyverse

R has been around since the early 1990’s and since then many developments have occurred within the language. A recent major development is the invention of the tidyverse. The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Tidy data

The tidyverse packages aim to help create and work with tidy data. Tidy data is a format for storing your data sets that make it easier to analyse. Tidy data will have the following properties:

  • Each variable forms a column.
  • Each observation forms a row.
  • Each value has it’s own cell.

One of the main packages in the tidyverse for working with tidy data is dplyr and we will now look at how it works.

2 dplyr

dplyr is an R package that provides a grammar of data manipulation. The package is part of the tidyverse and it simplifies the process of common data manipulation tasks. Using dplyr requires learning the functions that it provides. In keeping with the package being a grammar, these functions are sometimes called verbs. The verbs/functions we will look at now are:

  • select() picks variables based on their names.
  • rename() rename variables.
  • filter() picks rows based on their values.filter to get certain rows
  • arrange() changes the ordering of the rows.
  • mutate() adds new variables that are functions of existing variables.

Similar Stata commands

  • select() is keep or drop
  • filter() is keep if ... or drop if ...
  • arrange() is sort or gsort
  • mutate() is gen or egen

Select (Picking columns)

The syntax for these verbs are nearly always they same. They require the name of the data frame we wish to manipulate, and the subsequent arguments relate to the variables we wish to work with.

For instance, if we want to select the country variable from the gapminder data frame, we type

select(gapminder, country) # select a column
select(gapminder, -country) # deselect a column

Filter (Picking rows)

If we want to filter the gapminder data frame to be only observations from the year 1952 then we use filter()

filter(gapminder, year == 1952) # Select only from 1952
filter(gapminder, year == 1952 & continent == "Africa") # African countries from 1952
# Either 1952 or countries from Africa at anyyear
filter(gapminder, year == 1952 | continent == "Africa") 
Logical operators in R
Operators Description
== equal to
!= not equal to
> greater than
< less than
>= greater than or equal to
=< less than or equal to
x & y x and y
x | y x or y
!x not x
x %in% y x is in y

To sort the data by year we can use the arrange command

Arrange (Sorting)

arrange(gapminder, year) # default is increasing
arrange(gapminder, desc(year)) # decreasing

Mutate (Create new variables)

The mutate() function is used to create new variables. We mutate the data frame by adding extra variables.

mutate(gapminder, longlifeexp = lifeExp>80)

Group_by and summarise (Aggregation and summary stats)

Sometimes data will need to be aggregated for analysis.

  • group_by() groups the data by one or more variables
  • summarise() summarise the data based on groupings

The group_by verb can be used to group data. Unlike the previous verbs where they were used on their own, group_by() only becomes useful when combined with other verbs like summarise(). summarise() can be used to calculate the aggregate values for the group. The n() function counts the number of rows in each category. It is a helper function within the package dplyr.

gapminderworld <- group_by(gapminder, year)
summarise(gapminderworld, meanLifeExp = mean(lifeExp), countries = n())
Useful functions for summarise
Operators Description
n() count rows in each group
min() minimum value
mean() mean
sd() standard deviation
max() maximum value
first() first value in each group
last() last value in each group

3 Piping or speaking in sentences!

The verbs in dplyr allow for a wide range of manipulations that are required in data analysis. There are more verbs and they can be found by searching for dplyr or visiting http://dplyr.tidyverse.org/. As we have seen the verbs can be used one at a time which creates a new data frame each time. Now we will look at how the verbs can be put together to create sentences using an object called a pipe.

A pipe place the object on the current line as the first argument in the next functions. This can sound a bit confusing at first but we’ll go through an example to show what this means.

In R a pipe is represented by the symbol %>%. This symbol can be created in RStudio using Ctrl+shift+m. With the vocabulary of dplyr and more generally the %>% can be read as saying “and then.

# Piping
gapminder %>% 
  group_by(year) %>% 
  summarise(meanLifeExp = mean(lifeExp), countries = n())

# Without piping
summarise(group_by(gapminder, year), meanLifeExp = mean(lifeExp), countries = n())

This code can be read as “Take the data frame gapminder, and then group the data by year, and then, summarise the groups by calculating the mean life expectancy and count the number of countries in each group

The pipe operator allows even more complex data manipulations to be performed in one function call.

gapminder %>% 
  select(year, continent, lifeExp) %>% 
  filter(continent== "Asia") %>% 
  group_by(year) %>% 
  summarise(meanLifeExp = mean(lifeExp), countries = n())

This code can be read as “Take the data frame gapminder, and then select the variables year, continent, and lifeExp, and then filter the row where the variable continent equals Asia, and then group by by year, and then summarise by calculating the mean life expectancy and count the number of countries in each group”

4 Advanced dplyr

Helper functions

There are a bunch of helper functions which can be used to select columns with common features.

  • starts_with(): starts with a prefix
  • ends_with(): ends with a prefix
  • contains(): contains a literal string
  • matches(): matches a regular expression
  • num_range(): a numerical range like x01, x02, x03.
  • one_of(): variables in character vector.
  • everything(): all variables.
  • last_col(): last variable, possibly with an offset.

For instance, we can select all the columns that have a similar prefix or suffix.

gapminder %>% 
  select(starts_with("c")) # Select columns that start with c
gapminder %>% 
  select(ends_with("p")) # Select columns that end with p

all, if, at

There are variants of the dplyr verbs for more complex situations. The variants have a either an _if, _all or _at. These can be used when performing manipulations on more than one variables. We will look at the the variants for summarise

  • _all affects every variable
  • _at affects variables selected with a character vector or vars()
  • _if affects variables selected with a predicate function:

summarise_all summarise_if summarise_at

gapminder %>% 
  summarise_all(mean)
gapminder %>% 
  summarise_if(is.double, mean)
gapminder %>% 
  summarise_if(is.integer, mean)
gapminder %>% 
  summarise_at(c("lifeExp", "pop"), mean)