Improve Working with Incomplete Data

Author
Affiliation

Political Science, University of South Carolina

Published

2/6/23

Attention!

Here, I assume:

  1. that you already know how to import data into R.
  2. that you already know how to use tidyverse workflow using pipes (%>%) from magrittr or native pipes (|>). I will use native pipes.

The goal of today’s class is to identify:

  • what do we need to merge different data sets?

  • what are possible/potential problems when we merge data sets?

Load Packages

Example Data in Use

We will use two different example data sets from V-Dem and World Development Indicators. The two example data sets are available for download here: example1 and example2. Because the example data is in csv format, you must use the appropriate function to import the example data sets into R. I put them in a folder called examples. We will also use other data sets that are built into R.

example1 <- read_csv("examples/example1.csv") |> 
  dplyr::select(COWcode, country_name, year, v2x_polyarchy, contains("e_"))
example2 <- read_csv("examples/example2.csv") |> 
  dplyr::select(ccodecow, year, contains("wdi_"))

Let’s look at the data set structures:

glimpse(example1)
Rows: 3,922
Columns: 7
$ COWcode       <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, …
$ country_name  <chr> "Mexico", "Mexico", "Mexico", "Mexico", "Mexico", "Mexic…
$ year          <dbl> 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 20…
$ v2x_polyarchy <dbl> 0.684, 0.694, 0.702, 0.710, 0.710, 0.675, 0.644, 0.643, …
$ e_polity2     <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, NA…
$ e_fh_cl       <dbl> 3, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
$ e_fh_pr       <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…

example1 is drawn from the V-Dem data set. glimpse() function is tidyverse version of the base R function, str(), which shows the structure of a data frame. example1 has 3,922 observations (rows) and 9 variables (columns). It has four id variables (country_name, country_text_id, country_id, COWcode), as well as one year variable (year). v2x_polyarchy is the “Electoral Democracy Index (EDI),” which measures procedural democracy based on the suggestions of Dahl (1971) . e_polity2 is an alternative measure of democracy from POLITY project, while e_fh_cl and e_fh_pr are drawn from the “Freedom House Index,” which represents civil liberties and political rights.

glimpse(example2)
Rows: 4,220
Columns: 5
$ ccodecow             <dbl> 700, 700, 700, 700, 700, 700, 700, 700, 700, 700,…
$ year                 <dbl> 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2…
$ wdi_gdpcappppcon2017 <dbl> NA, 1189.785, 1235.810, 1200.278, 1286.794, 1315.…
$ wdi_pop1564          <dbl> 48.90547, 49.09875, 49.34294, 49.61243, 49.91287,…
$ wdi_trade            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

example2 is from the World Development Indicators, World Bank. When we look at the data, we see that there are 4,220 observations (rows) and 11 variables (columns). example2 includes five country-id variables (ccodecow) as well as year-related variables (year). It also has two additional variables: GDP per capita, PPP (constant 2017 US dollars), which is gross domestic product converted to US dollars using purchasing power parity rates and the trade openness that is the sum of exports and imports of goods and services measured as a share of gross domestic product.

Because I filtered both example data sets from 2001, example1 is between 2001 and 2022 and example2 is between 2001 and 2020.

Incomplete Data, what does it mean?

What does the term “incomplete data” mean? Because survey data is designed to answer specific questions, the information provided by the data is usually sufficient to answer our research questions. Also, as the units are different, it is difficult to merge survey data with other data sets unless we keep tracing and surveying the same respondents.

When we use non-survey observational data, we may run into an incomplete data problem, which means that a single data set is insufficient to answer our questions. To solve this problem, we must combine multiple incomplete data sets to make them complete.

The process of combining multiple data sets into a single data set is known as merging. Adding inflation factors to panel data to adjust income to today’s rates is one example, as is adding county-level statistics to individual-level data. We’ll look at two kinds of merges: adding columns from one data set to another and adding rows (also known as appending).

Adding rows

The actual act of merging two data sets usually involves only one or two lines of code. Most of the work when merging happens before this, in preparing data frames for merging and in deciding which rows we want in the output.

Let’s create two data frames from example1: example1_1 and example1_2.

example1_1 <- example1 |> 
  dplyr::filter(country_name %in% c("Mexico", "South Korea"))
example1_2 <- example1 |> 
  dplyr::filter(country_name %in% c("United States of America", "South Korea"))

To merge two data frames, we need keys, identifiers that we can use to match one set of information to another. A common example of a key in the social sciences is a person ID, which can be used to link information across multiple data sets. However, our example1 is not an individual level data. Thus, keys will be country and year identifiers: country_name, COWcode, year.

head(example1_1)
glimpse(example1_1)
Rows: 44
Columns: 7
$ COWcode       <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, …
$ country_name  <chr> "Mexico", "Mexico", "Mexico", "Mexico", "Mexico", "Mexic…
$ year          <dbl> 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 20…
$ v2x_polyarchy <dbl> 0.684, 0.694, 0.702, 0.710, 0.710, 0.675, 0.644, 0.643, …
$ e_polity2     <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, NA…
$ e_fh_cl       <dbl> 3, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
$ e_fh_pr       <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
head(example1_2)
glimpse(example1_2)
Rows: 44
Columns: 7
$ COWcode       <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ country_name  <chr> "United States of America", "United States of America", …
$ year          <dbl> 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 20…
$ v2x_polyarchy <dbl> 0.824, 0.827, 0.862, 0.862, 0.857, 0.863, 0.898, 0.900, …
$ e_polity2     <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, …
$ e_fh_cl       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ e_fh_pr       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,…

What we want to accomplish with our merge is to collect them all into a single data set. The question we have to answer first is, which rows do we want to appear in our merged data set? Currently, both data sets share variables country_name, COWcode, and year. Beyond that, one has data for Mexico, and the other for the United States of America.

Now, you know that the two different data sets share the keys and have only one different values for country. When the data sets share the keys, it is easy to combine them into one data frame: use bind_rows().

bind_rows(example1_1, example1_2)
bind_rows(example1_1, example1_2) |> glimpse()
Rows: 88
Columns: 7
$ COWcode       <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, …
$ country_name  <chr> "Mexico", "Mexico", "Mexico", "Mexico", "Mexico", "Mexic…
$ year          <dbl> 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 20…
$ v2x_polyarchy <dbl> 0.684, 0.694, 0.702, 0.710, 0.710, 0.675, 0.644, 0.643, …
$ e_polity2     <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, NA…
$ e_fh_cl       <dbl> 3, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
$ e_fh_pr       <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…

When you supply a column name with the .id argument, a new # column is created to link each row to its original data frame bind_rows(list(example1_1, example1_2), .id = "id")

bind_rows(list(example1_1, example1_2), .id = "id")
bind_rows("group 1" = example1_1, "group 2" = example1_2, .id = "groups")

Columns don’t need to match when row-binding.

bind_rows(tibble(x = 1:3), tibble(y = 1:4))

Row sizes must be compatible when column-binding, otherwise you will see error messages.

try(bind_cols(tibble(x = 1:3), tibble(y = 1:2)))
Error in bind_cols(tibble(x = 1:3), tibble(y = 1:2)) : 
  Can't recycle `..1` (size 3) to match `..2` (size 2).

Also, bind_rows() is picky about matching column types, and it will not automatically coerce columns to be of the same type. Instead, it will return an error. We can fix this by manually coercing column types to character before merging.

survey1 <-
  data.frame(Q1 = c(1, 1),
             Q2 = c(3, 1),
             Q2a = c("Sometimes", NA),
             Q3 = c(6, 7),
             Q3a = c("Never", "Always"))
glimpse(survey1)
Rows: 2
Columns: 5
$ Q1  <dbl> 1, 1
$ Q2  <dbl> 3, 1
$ Q2a <chr> "Sometimes", NA
$ Q3  <dbl> 6, 7
$ Q3a <chr> "Never", "Always"
survey2 <-
  data.frame(Q1 = c(2, 4, 4),
             Q2 = c(1, 5, 3),
             Q2a = c(NA, NA, 2),
             Q3 = c(4, 2, 5),
             Q3a = c(NA, NA, NA))
glimpse(survey2)
Rows: 3
Columns: 5
$ Q1  <dbl> 2, 4, 4
$ Q2  <dbl> 1, 5, 3
$ Q2a <dbl> NA, NA, 2
$ Q3  <dbl> 4, 2, 5
$ Q3a <lgl> NA, NA, NA

When we use bind_rows() for survey1 and survey2, the error message tells us that column Q2a is character in one data frame and numeric in the other.

bind_rows(survey1, survey2)
Error in `bind_rows()`:
! Can't combine `..1$Q2a` <character> and `..2$Q2a` <double>.

This happened here because c() automatically assigned a type based on the values it observed. When we read in data sets, read.csv() and related functions will also use the values to determine the type. Differences in responses between two data sets, especially for open-response questions, can lead to different types assigned to the two data sets’ columns.

To fix this, simply coerce the problem column in one data set to match the type of that column in the other data set. Then, use bind_rows().

survey2b <- 
  survey2 |>
  mutate(Q2a = as.character(Q2a))

surveys <- bind_rows(survey1, survey2b)

surveys

bind_rows() changes the number of rows (observations) of the combined data set.

When is bind_rows() useful?

bind_rows() can be useful if you have multiple data sources and want to combine them into one. Suppose you download multiple data sets for a country but from different years from the same website. If they have the same keys, you can use bind_rows() to combine them and create a new variable, year with the .id option in bind_rows().

Adding columns

While bind_rows() is about the number of observations, bind_cols() is about the number of variables. When you have a set of shared keys and two different data sets with different variables, you can combine them into one.

names(example1_1)
[1] "COWcode"       "country_name"  "year"          "v2x_polyarchy"
[5] "e_polity2"     "e_fh_cl"       "e_fh_pr"      
names(example1_2)
[1] "COWcode"       "country_name"  "year"          "v2x_polyarchy"
[5] "e_polity2"     "e_fh_cl"       "e_fh_pr"      

example1_1 contains the variable, v2x_polyarchy whereas example1_2 has e_polity2, e_fh_cl, e_fh_pr. As you may be aware, the two separated data sets share country and year identifiers. Thus, you want to create a combined data set with country and year identifiers with v2x_polyarchy, e_polity2, e_fh_cl, e_fh_pr.

bind_cols(example1_1, example1_2)

However, it seems something weird since it has two times of variables more than the original data sets. What happens? bind_cols() function binds any number of data frames by column, making a wider result. However, it requires that row sizes must be compatible when column-binding. Therefore, where possible prefer using a join to combine multiple data frames. bind_cols() binds the rows in order in which they appear so it is easy to create meaningless results without realising it. We will see the *_join() functions in the next section.

glimpse(bind_cols(example1_1, example1_2))
Rows: 44
Columns: 14
$ COWcode...1        <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70,…
$ country_name...2   <chr> "Mexico", "Mexico", "Mexico", "Mexico", "Mexico", "…
$ year...3           <dbl> 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 200…
$ v2x_polyarchy...4  <dbl> 0.684, 0.694, 0.702, 0.710, 0.710, 0.675, 0.644, 0.…
$ e_polity2...5      <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, …
$ e_fh_cl...6        <dbl> 3, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
$ e_fh_pr...7        <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, …
$ COWcode...8        <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ country_name...9   <chr> "United States of America", "United States of Ameri…
$ year...10          <dbl> 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 200…
$ v2x_polyarchy...11 <dbl> 0.824, 0.827, 0.862, 0.862, 0.857, 0.863, 0.898, 0.…
$ e_polity2...12     <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,…
$ e_fh_cl...13       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ e_fh_pr...14       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, …
When is bind_cols() useful?

bind_cols() can be useful when you combine two different data frames, which you already know what they are:

# Load diamonds data set
data(diamonds)

# Fit a simple regression model
fit <- lm(price ~ carat, data = diamonds)

# Make the fitted results as data frame
if (!require(broom)) install.packages("broom")
broom::tidy(fit)
# {broom} package provides the function, tidy(), which returns the fitted
# results as a tidy data frame.
# However, it does not provide you confidence intervals.
# You can get confidence intervals using the function confint_tidy().
broom::confint_tidy(fit)
# Now you know that the two data frames can be combined:
bind_cols(
  broom::tidy(fit),
  broom::confint_tidy(fit)
)

Relational Data

This subsection is borrowed from Joins section of Wickham, Çetinkaya-Rundel, and Grolemund (2023) (Hereafter, R4DS). Also, see Social Science Computing Cooperative (2023) and Peng (2020) .

It’s rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in. Collectively, multiple tables of data are called relational data because it is the relations, not just the individual data sets, that are important.1

Relations are always defined between a pair of tables. All other relations are built up from this simple idea: the relations of three or more tables are always a property of the relations between each pair. Sometimes both elements of a pair can be the same table! This is needed if, for example, you have a table of people, and each person has a reference to their parents.

To work with relational data you need verbs that work with pairs of tables. There are three families of verbs designed to work with relational data:

  • Mutating joins, which add new variables to one data frame from matching observations in another.

  • Filtering joins, which filter observations from one data frame based on whether or not they match an observation in the other table.

  • Set operations, which treat observations as if they were set elements.

The dplyr package provides a set of functions for joining two data frames into a single data frame based on a set of key columns. There are several functions in the *_join family. These functions all merge together two data frames; they differ in how they handle observations that exist in one but not both data frames. Here are the four functions from this family that you will likely use the most often:

Function What it includes in merged data frame
left_join Includes all observations in the left data frame, whether or not there is a match in the right data frame
right_join Includes all observations in the right data frame, whether or not there is a match in the left data frame
inner_join Includes only observations that are in both data frames
full_join Includes all observations from both data frames

A common task in data analysis is to bring different data sets together, so that we can combine columns from two (or more) tables together.

This can be achieved using the join family of functions in dplyr. There are different types of joins, which can be represented by a series of Venn diagrams:

Mutating joins

head(example1); head(example2)

The by option is used to tell the join function which column(s) are used to “match” the rows of the two tables.

The function *_join() is used as a standalone command to create the a new joined_data data frame. The inputs are data frames 1 and 2 (example1 and example2). If you use left_join(), the first data frame listed is the baseline data frame, and the second one listed is joined to it. If you use right_join(), the baseline would be the second one.

The third argument by = is where you specify the columns in each data frame that will be used to aligns the rows in the two data frames. If the names of these columns are different, provide them within a c() vector, where the rows are matched on the basis of common values between the keys that example1 and example2 share.

If you are joining the data frames based on common values across multiple fields, list these fields within the c() vector.

Inner join

An inner join is the most restrictive of the joins - it returns only rows with matches across both data frames.
This means that the number of rows in the baseline data frame may actually reduce. Adjustment of which data frame is the “baseline” (written first in the function) will not impact which rows are returned, but it will impact the column order, row order, and which identifier columns are retained.

Animated example of an inner join (image source)

We can see that the output has fewer rows than both of these tables, which makes sense given we’re only keeping the “intersection” between the two.

# the `all()` function checks whether all values of a logical vector are TRUE
all(example1$COWcode %in% example2$ccodecow)
[1] FALSE

And the same happens for year:

all(example1$year %in% example2$year)
[1] FALSE

So, different types of join operations (represented in the figure above) will result in different outcomes. Let’s start by doing an inner join, which retains only the entries common to both tables:

example1 |> drop_na(COWcode, year) |> 
  inner_join(example2 |> drop_na(ccodecow, year),
             by = c("COWcode" = "ccodecow", "year"))
example1 |> drop_na(COWcode, year) |> 
  inner_join(example2 |> drop_na(ccodecow, year),
             by = c("COWcode" = "ccodecow", "year")) |> glimpse()
Rows: 3,450
Columns: 10
$ COWcode              <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
$ country_name         <chr> "Mexico", "Mexico", "Mexico", "Mexico", "Mexico",…
$ year                 <dbl> 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2…
$ v2x_polyarchy        <dbl> 0.684, 0.694, 0.702, 0.710, 0.710, 0.675, 0.644, …
$ e_polity2            <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8…
$ e_fh_cl              <dbl> 3, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
$ e_fh_pr              <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3…
$ wdi_gdpcappppcon2017 <dbl> 17521.97, 17276.15, 17288.63, 17719.96, 17874.03,…
$ wdi_pop1564          <dbl> 61.10149, 61.50048, 61.86297, 62.20537, 62.53782,…
$ wdi_trade            <dbl> 47.16607, 46.69791, 50.20569, 53.48615, 53.93813,…
# You can find that the number of rows of the inner joined data frame is 
# smaller than example1 and example2
glimpse(example1); glimpse(example2)
Rows: 3,922
Columns: 7
$ COWcode       <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, …
$ country_name  <chr> "Mexico", "Mexico", "Mexico", "Mexico", "Mexico", "Mexic…
$ year          <dbl> 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 20…
$ v2x_polyarchy <dbl> 0.684, 0.694, 0.702, 0.710, 0.710, 0.675, 0.644, 0.643, …
$ e_polity2     <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, NA…
$ e_fh_cl       <dbl> 3, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
$ e_fh_pr       <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
Rows: 4,220
Columns: 5
$ ccodecow             <dbl> 700, 700, 700, 700, 700, 700, 700, 700, 700, 700,…
$ year                 <dbl> 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2…
$ wdi_gdpcappppcon2017 <dbl> NA, 1189.785, 1235.810, 1200.278, 1286.794, 1315.…
$ wdi_pop1564          <dbl> 48.90547, 49.09875, 49.34294, 49.61243, 49.91287,…
$ wdi_trade            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

I used drop_na() for the keys as the *_join() functions treat NAs as string (text) when they join the data. Thus, unless you do not manage the potential NAs, you will get multiple country code NA - year NA combinations for the final outcomes.

Left join

Now suppose we want to create a table that combines the information about different democracy measurements with country-year units. We can use the left_join() function to merge the example1 and example2 tables.

First, when you use left_join(), you have to determine which data frame should be a baseline. If I want to use the dependent variable from example1, example1 should be the baseline:

example1 |> drop_na(COWcode, year) |> 
  left_join(example2 |> drop_na(ccodecow, year),
             by = c("COWcode" = "ccodecow", "year"))
example1 |> drop_na(COWcode, year) |> 
  left_join(example2 |> drop_na(ccodecow, year),
             by = c("COWcode" = "ccodecow", "year")) |> glimpse()
Rows: 3,818
Columns: 10
$ COWcode              <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
$ country_name         <chr> "Mexico", "Mexico", "Mexico", "Mexico", "Mexico",…
$ year                 <dbl> 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2…
$ v2x_polyarchy        <dbl> 0.684, 0.694, 0.702, 0.710, 0.710, 0.675, 0.644, …
$ e_polity2            <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8…
$ e_fh_cl              <dbl> 3, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
$ e_fh_pr              <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3…
$ wdi_gdpcappppcon2017 <dbl> 17521.97, 17276.15, 17288.63, 17719.96, 17874.03,…
$ wdi_pop1564          <dbl> 61.10149, 61.50048, 61.86297, 62.20537, 62.53782,…
$ wdi_trade            <dbl> 47.16607, 46.69791, 50.20569, 53.48615, 53.93813,…

Right join

right_join() works perfectly same with the left_join(), but the setting of baseline is different:

example1 |> drop_na(COWcode, year) |> 
  right_join(example2 |> drop_na(ccodecow, year),
             by = c("COWcode" = "ccodecow", "year"))
example1 |> drop_na(COWcode, year) |> 
  right_join(example2 |> drop_na(ccodecow, year),
             by = c("COWcode" = "ccodecow", "year")) |> glimpse()
Rows: 3,920
Columns: 10
$ COWcode              <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
$ country_name         <chr> "Mexico", "Mexico", "Mexico", "Mexico", "Mexico",…
$ year                 <dbl> 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2…
$ v2x_polyarchy        <dbl> 0.684, 0.694, 0.702, 0.710, 0.710, 0.675, 0.644, …
$ e_polity2            <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8…
$ e_fh_cl              <dbl> 3, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
$ e_fh_pr              <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3…
$ wdi_gdpcappppcon2017 <dbl> 17521.97, 17276.15, 17288.63, 17719.96, 17874.03,…
$ wdi_pop1564          <dbl> 61.10149, 61.50048, 61.86297, 62.20537, 62.53782,…
$ wdi_trade            <dbl> 47.16607, 46.69791, 50.20569, 53.48615, 53.93813,…

Left join vs. Right join

Animated examples of left and right joins (image source)

Full join

A full join is the most inclusive of the joins - it returns all rows from both data frames.

If there are any rows present in one and not the other (where no match was found), the data frame will include them and become longer. NA missing values are used to fill-in any gaps created. As you join, watch the number of columns and rows carefully to troubleshoot case-sensitivity and exact character matches.

The “baseline” data frame is the one written first in the command. Adjustment of this will not impact which records are returned by the join, but it can impact the resulting column order, row order, and which identifier columns are retained.

Animated example of a full join (image source)

example1 |> drop_na(COWcode, year) |> 
  full_join(example2 |> drop_na(ccodecow, year),
             by = c("COWcode" = "ccodecow", "year"))
example1 |> drop_na(COWcode, year) |> 
  full_join(example2 |> drop_na(ccodecow, year),
             by = c("COWcode" = "ccodecow", "year")) |> glimpse()
Rows: 4,288
Columns: 10
$ COWcode              <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
$ country_name         <chr> "Mexico", "Mexico", "Mexico", "Mexico", "Mexico",…
$ year                 <dbl> 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2…
$ v2x_polyarchy        <dbl> 0.684, 0.694, 0.702, 0.710, 0.710, 0.675, 0.644, …
$ e_polity2            <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8…
$ e_fh_cl              <dbl> 3, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
$ e_fh_pr              <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3…
$ wdi_gdpcappppcon2017 <dbl> 17521.97, 17276.15, 17288.63, 17719.96, 17874.03,…
$ wdi_pop1564          <dbl> 61.10149, 61.50048, 61.86297, 62.20537, 62.53782,…
$ wdi_trade            <dbl> 47.16607, 46.69791, 50.20569, 53.48615, 53.93813,…

Filtering joins

Semi join

A semi join is a “filtering join” which uses another data set not to add rows or columns, but to perform filtering.

A semi-join keeps all observations in the baseline data frame that have a match in the secondary data frame (but does not add new columns nor duplicate any rows for multiple matches).

Animated example of a semi join (image source)

Anti join

The anti join is another “filtering join” that returns rows in the baseline data frame that do not have a match in the secondary data frame.

Common scenarios for an anti-join include identifying records not present in another data frame, troubleshooting spelling in a join (reviewing records that should have matched), and examining records that were excluded after another join.

As with right_join() and left_join(), the baseline data frame (listed first) is important. The returned rows are from the baseline data frame only. Notice in the gif below that row in the secondary data frame (purple row 4) is not returned even though it does not match with the baseline.

Animated example of an anti join (image source)

Make Dyads using Existing Data

See details in Steven Miller’s post: here. Following explanations on directed dyads-year and non-directed dyads-year come from his post.

Directed dyads-year

Directed dyad-year data are useful when the researcher is interested in, say, conflict onsets in a given dyad-year and it is important who initiates the dispute. In this interpretation, “France-Germany, 1911” and “Germany-France, 1911” are importantly different observations because Germany initiated the Agadir Crisis (MID#0315) against France and that distinction matters. There is likely an application of directed dyad-year panel frames for IPE researchers interested in, say, directional trade flows.

Here, I used the directed dyads-year data using example1 and example2.

dyadic_df <- example1 |> drop_na(COWcode, year) |> 
  full_join(example1 |> drop_na(COWcode, year), by = "year", suffix = c("_1", "_2")) |> 
  dplyr::filter(COWcode_1 != COWcode_2)   # Removing same country pairs

Non-directed dyads

Non-directed dyad-year data are (I think) the most common form of dyad-year analyses of inter-state conflict. In other words, the distinction between who initiated a MID versus who was targeted in a MID is irrelevant. It does not matter that Germany initiated a MID against France in 1911 in Agadir. It only matters that there was one, especially between those two countries at that point in time.

In practice, creating non-directed dyad-year data means ccode1 is whichever state has the lower ccode in the dyad. That means the United States (ccode == 2) will always be first in any non-directed dyad in which it is a party and Samoa (ccode == 990) will never be first in any non-directed dyad in which it is a member.

Once you understand that distinction, creating a non-directed dyad-year data frame from the directed dyad-year data frame is simple.

nd_dyadic_df <- example1 |> drop_na(COWcode, year) |> 
  full_join(example1 |> drop_na(COWcode, year), by = "year", suffix = c("_1", "_2")) |> 
  dplyr::filter(COWcode_1 != COWcode_2) |>    # Removing same country pairs
  dplyr::filter(!duplicated(
    tibble(min = pmin(COWcode_1, COWcode_2),
           max = pmax(COWcode_1, COWcode_2), year)))  # Removing reverse duplicates

nd_dyadic_df2 <- example1 |> drop_na(COWcode, year) |> 
  full_join(example1 |> drop_na(COWcode, year), by = "year", suffix = c("_1", "_2")) |> 
  dplyr::filter(COWcode_1 != COWcode_2) |>    # Removing same country pairs
  dplyr::filter(COWcode_1 < COWcode_2)  # Removing reverse duplicates

Complete Combination

I’ve always found it difficult to distinguish the functions complete, expand, nesting, and crossing from another. In a sense, they do similar things. They find combinations of values in vectors or columns. I originally thought of writing a separate tutorial for each function, but digesting them all at once makes it easier to tell the difference between them. Let’s take some time to look into these functions. And let’s start with an overview of what they do:

Fuction Explanation
complete Turn implicit missing values into explicit values. The function completes combinations of values from columns that exist in a data frame and/or from vectors. complete is a shortcut version of expand.
expand Creates a new tibble with all possible combinations of values from a data frame. The function is often used with joins.
expand with nesting Create a new tibble with the unique combinations of column values that exist in a data frame.
crossing Create a tibble with all combinations of values from vectors.

expand, in contrast, creates a new tibble. The tibble represents either all possible combinations of values or the unique combinations of values (with nesting). Suppose you have an incomplete tibble with months and years, in which the combination of the month “February” and the year 2013 is missing. You can use expand to create a complete set of years and months, including the combination of February and 2013. crossing works similarly to expand, except that it uses vectors to create combinations of values in a data frame. However, as we will see later, expand can also do this.

For this tutorial, we will use a made-up data set of running events. Suppose the following data frame shows the running races a runner has completed since 2010. The minutes show the time it took this person to complete the races. Let’s call her Anna.

running_races_anna <- tribble(
  ~year, ~race, ~minutes,
  2010, "half marathon", 110,
  2011, "marathon", 230,
  2013, "half marathon", 105,
  2016, "10km", 50,
  2018, "10km", 45,
  2018, "half marathon", 100,
  2022, "marathon", 210
)

You can see that some years are missing. Anna didn’t run a race in 2012. Also, she did not run a half marathon in 2016. In the next chapters, we will try to complete this data frame with the four functions. Let’s start with complete.

Complete

Let’s assume Anna has only run 10Ks, half marathons, and marathons in recent years. Let’s further assume that she could have participated in every race each year. How many running races could she have participated in then?

A first approach could be to convert the implicit combinations into explicit combinations using complete. This essentially means nothing more than creating new rows representing the runs in which it did not participate. For these runs, complete sets the values of the minutes column to NA:

running_races_anna |> 
  complete(year, race)

Looking at the number of rows she could have participated in 18 competitions. But could she? You might see that we are missing some years. There is no data from 2012 or 2014. This is because complete only works with the values that are already in the data. Since she never participated in a race in 2012, we don’t see these races.

However, we can add these values if we use vectors instead of columns. Suppose we want to ensure that the data frame includes all years between 2010 and 2022 and all three running events that are already present in the data:

running_races_anna |> 
  complete(year = 2010:2022, race)

All combinations of values that were already present in the data did not change. However, the code added rows that were not present. In other words, it added rows with years and races that were not present in the original data frame.

We could even go so far as to include new races to the data frame (i.e. ultra marathons):

running_races_anna |> 
  complete(year = 2010:2022, race = c(race, "ultra marathons"))

Look how we created a vector that includes the races already present in the data plus ultra marathons.

Expand

The expand function does something very similar. However, instead of adding new rows with the complete set of values, a new data frame is created only for the columns you specify in the function (compared to complete, where we kept the minutes column).

First, let’s create a simple example. Let’s create a complete combination of years and races from the existing data frame:

running_races_anna |> 
  tidyr::expand(year, race)

The result is a new data frame. You can see that the column minutes is missing. Similar to complete we can specify a vector instead of a column, for example to make sure that the data frame covers all years from 2010 to 2022:

running_races_anna |> 
  tidyr::expand(year = 2010:2022, race)

A neat trick to complete the years is the full_seq function:

running_races_anna |> 
  tidyr::expand(year = full_seq(year, 1), race)

In this case full_seq generated the complete set of years, starting with the lowest year in the data frame and ending with the highest year. The 1 indicates that the years should be incremented by 1 each time.

So we have a handle on all the combinations of years and races in our data frame. But we are missing the actual data, namely the minutes Anna took for these races. To add this data to the data frame, we combine expand with full_join:

(all_running_races_anna <- running_races_anna |> 
    tidyr::expand(year = full_seq(year, 1), race) |> 
    full_join(running_races_anna, by = c("year", "race")))

This data frame includes all 39 races that Anna could have participated in between 2010 and 2022.

You may wonder why you should use expand instead of complete at all? The result is the same we got with complete. And the code it is more complicated.

If you take a look at the document, you will see that complete is actually a wrapper around expand. In other words, it is expand combined with full_join (see the official code on GitHub). Essentially, it is a shortcut for the more complicated code we just used. We will show this in the upcoming examples.

Expand/Complete with group_by

Now let’s imagine Anna is running in a club with three other runners. Eva, John and Leonie.

running_races_club <- tribble(
  ~year, ~runner, ~race, ~minutes,
  2012, "Eva", "half marathon", 109,
  2013, "Eva", "marathon", 260,
  2022, "Eva", "half marathon", 120,
  2018, "John", "10km", 51,
  2019, "John", "10km", 49,
  2020, "John", "10km", 50,
  2019, "Leonie", "half marathon", 45,
  2020, "Leonie", "10km", 45,
  2021, "Leonie", "half marathon", 102,
  2022, "Leonie", "marathon", 220
)

Again, you want to find all races that each runner could have participated in since joining the club. If we used the same expand technique we just did, we will run into problems:

(all_running_races_club <- running_races_club |> 
    tidyr::expand(year = full_seq(year, 1), race, runner))

Take John, for example:

all_running_races_club |> 
  dplyr::filter(runner %in% "John")

He joined the club in 2019. However, the data frame shows missed races from 2012. This is because the data frame contains the races of Eva, who joined in 2012.

We can fix this problem by grouping the data frame by runners.

(all_running_races_club_correct <- running_races_club |> 
    group_by(runner) |> 
    tidyr::expand(year = full_seq(year, 1), 
           race = c("10km", "half marathon", "marathon")) |> 
    ungroup())

With group_by we expand the rows only within the runners. If you now take a look at the data, you will notice that John has no races before 2018, which is exactly what we want.

all_running_races_club_correct |> 
  dplyr::filter(runner %in% "John")

Yet, we still need the actual data of the three runners. We use left_join to add the running times to the expanded data frame:

(complete_running_races_club <- all_running_races_club_correct |> 
    left_join(running_races_club, by = c("year", "runner", "race")))

Since we already know that complete is a shortcut for such an analysis, we can use it instead:

running_races_club |> 
  group_by(runner) |> 
  complete(year = full_seq(year, 1), 
           race = c("10km", "half marathon", "marathon")) |> 
  ungroup()

With this data we can do some interesting analysis. We could visualize the percentage of competitions in which each runner actually participated.

First, we need to find out how many races each runner has completed. To do this, we count the number of races that a runner has or has not completed:

(count_races <- complete_running_races_club |> 
    count(runner, race, missed_races = is.na(minutes)))

We see that Eva has not completed a single 10-km run in the years she has been a member of the club, because there is a row missing where the missed_races column is set to FALSE.

Fortunately, we have learned that we can complete an existing data frame with complete. Let’s do that:

count_races |> 
  complete(runner, race, missed_races, fill = list(n = 0))

The code has an interesting addition, the fill parameter. The parameter allows us turn NAs to actual values. Since we know that the missing rows represent the number of races that were or were not finish, we can be sure that they represent zero races. For Eva, for example, a row is missing for the 10km races in which she never participated.

Now that we have the complete count data of races per runner, we can calculate the percentage of races they participated in. To calculate the percentages, we must first put the data into a wide format and then create a column that represents the percentages:

count_races |> 
  complete(runner, race, missed_races, fill = list(n = 0)) |> 
  pivot_wider(names_from = missed_races, values_from = n) |> 
  mutate(
    percent_races = (`FALSE` / (`TRUE` + `FALSE`)) * 100)

Let’s talk about Eva again. She participated in 0% of the 10k races and in 18.18% of the possible half marathons. Since she ran only 1 of 11 marathons, she participated in 9% of the marathons.

This is how it looks for all runners:

count_races |> 
  complete(runner, race, missed_races, fill = list(n = 0)) |> 
  pivot_wider(names_from = missed_races, values_from = n) |> 
  mutate(
    percent_races = (`FALSE` / (`TRUE` + `FALSE`)) * 100) |> 
  ggplot(aes(x = race, y = percent_races)) +
  scale_y_continuous(labels = scales::label_percent(scale = 1)) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
  geom_col() +
  facet_wrap(vars(runner))

Expand with nesting

So far, we have completed data frames for missing rows. Sometimes, however, we are interested in the unique combinations of values in a data frame. Suppose your running club has 540 members. You want to know in which competitions a runner has participated during her or his time in the club. This is basically the opposite of what we just did. Instead of finding all combinations of values we are looking for the unique combinations; in a given data frame! To find these combinations we can combine expand with nesting:

running_races_club |> 
  tidyr::expand(nesting(runner, race))

Once again, you can see that Eva has never run a 10K. John has never run a half marathon or marathon. But we have to infer that information from the data frame. The data shows what happened, not what didn’t happen. To find out which runs the runners have never participated in, we can combine the code with anti_join:

full_combinations_runners <- 
  tidyr::expand(running_races_club, runner, 
                race = c("10km", "half marathon","marathon"))

full_combinations_runners |> 
  anti_join(running_races_club, by = c("runner", "race"))

The result of our analysis is now easier to process, as we no longer have to search for the known unknowns and get the desired results directly.

Crossing

Let’s talk about tennis. Suppose you want to create a data frame that shows all Grand Slams (Australian Open, French Open, Wimbledon, US Open) from 1905 to 2022. You don’t have an existing data frame at hand, so you need to create one from scratch.

For these cases you need crossing. The difference from the other functions is that crossing does not need an existing data frame. We use vectors instead:

crossing(
  year = 1905:2022,
  major = c("Australien Open", "French Open", "Wimbledon", "US Open")
)

This gives us a total of 472 Grand Slams. Similarly, we could create a data frame representing the World Marathon Majors, which started in 2006:

crossing(
  year = 2006:2022,
  races = c("Tokyo", "Boston", "Chicago",
            "London", "Berlin", "New York")
  )

The data itself only gives us a complete set of combinations, by itself it is not very meaningful. crossing is usually a starting point for further analyses. Imagine if we had a data set with all the world records set at these majors. We could join the world records to this data frame to determine the percentage of races in which a world record was set at the six majors.

How to Use complete, expand, crossing for Our Research?

Complete set of dyads-year using country code

If you want to make all the country dyads between 2000 and 2020, you can use the *_join() functions and complete(), expand(), or crossing() functions together.

First, using codelist from the countrycode package, let us make a data frame for possible set of country names from Correlates of War projects:

country_set <- tibble(
  cown = countrycode::codelist$cown
)
head(country_set)

Some small islands or economic entities are excluded in the COW projects (e.g., American Samoa). Thus, we can eliminate NAs from the column of cown.

country_set |> drop_na(cown) -> country_set_nona

Next, using crossing(), make a complete set of COW country code between 2000 and 2022.

crossing(
  year = 2000:2022,
  cown = country_set$cown
) |> arrange(cown, year) -> monadic_set

To make a full set of directed dyads-year, using full_join() function:

dir_dyads_set <- monadic_set |> 
  full_join(monadic_set, by = "year", suffix = c("_1", "_2")) |> 
  dplyr::filter(cown_1 != cown_2) # remove rows with the same cown_1 and cown_2

head(dir_dyads_set)

You can easily manipulate a full set of non-directed dyads-year.

nondir_dyads_set <- dir_dyads_set |> 
  dplyr::filter(cown_1 > cown_2)

head(nondir_dyads_set)

References

Dahl, Robert A. 1971. Polyarchy: Participation and Opposition. New Haven, CT: Yale University Press.
Peng, Roger D. 2020. 9 LECTURE: Joining Data in R: Basics | Statistical Computing (Biostatistics 140.776). https://rdpeng.github.io/Biostat776/lecture-joining-data-in-r-basics.html.
Social Science Computing Cooperative. 2023. 17 Merging | Data Wrangling with R. https://sscc.wisc.edu/sscc/pubs/dwr/merge-tidy.html.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd ed. Oreilly & Associates Inc. https://r4ds.hadley.nz/joins.html.

Footnotes

  1. Non-relational data has no tables, rows, primary keys or foreign keys. Instead, it uses a storage model optimized for specific requirements of the type of data being stored. You can see a json format data here.↩︎