Problem:
Given two data frames:
df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2, 4, 6), State = c(rep("Alabama", 2), rep("Ohio", 1)))
df1
# CustomerId Product
# 1 Toaster
# 2 Toaster
# 3 Toaster
# 4 Radio
# 5 Radio
# 6 Radio
df2
# CustomerId State
# 2 Alabama
# 4 Alabama
# 6 Ohio
How can I do database style, i.e., sql style, joins? That is, how do I get:
- An inner join of
df1
anddf2
:
Return only the rows in which the left table have matching keys in the right table. - An outer join of
df1
anddf2
:
Returns all rows from both tables, join records from the left which have matching keys in the right table. - A left outer join (or simply left join) of
df1
anddf2
Return all rows from the left table, and any rows with matching keys from the right table. - A right outer join of
df1
anddf2
Return all rows from the right table, and any rows with matching keys from the left table.
Extra credit:
How can I do a SQL style select statement?
How to join (merge) data frames (inner, outer, left, right) in R: Answer #1:
By using the merge
function and its optional parameters:
Inner join: merge(df1, df2)
will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId")
to make sure that you were matching on only the fields you desired. You can also use the by.x
and by.y
parameters if the matching variables have different names in the different data frames.
Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)
Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)
Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)
Cross join: merge(x = df1, y = df2, by = NULL)
Just as with the inner join, you would probably want to explicitly pass “CustomerId” to R as the matching variable. I think it’s almost always best to explicitly state the identifiers on which you want to merge; it’s safer if the input data.frames change unexpectedly and easier to read later on.
You can merge on multiple columns by giving by
a vector, e.g., by = c("CustomerId", "OrderId")
.
If the column names to merge on are not the same, you can specify, e.g., by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2"
where CustomerId_in_df1
is the name of the column in the first data frame and CustomerId_in_df2
is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)
How to join (merge) data frames (inner, outer, left, right) in R: Answer #2:
I would recommend checking out Gabor Grothendieck’s sqldf package, which allows you to express these operations in SQL.
library(sqldf)
## inner join
df3 <- sqldf("SELECT CustomerId, Product, State
FROM df1
JOIN df2 USING(CustomerID)")
## left join (substitute 'right' for right join)
df4 <- sqldf("SELECT CustomerId, Product, State
FROM df1
LEFT JOIN df2 USING(CustomerID)")
I find the SQL syntax to be simpler and more natural than its R equivalent (but this may just reflect my RDBMS bias).
How to join (merge) data frames (inner, outer, left, right) in R: Answer #3:
There is the data.table approach for an inner join, which is very time and memory efficient (and necessary for some larger data.frames):
library(data.table)
dt1 <- data.table(df1, key = "CustomerId")
dt2 <- data.table(df2, key = "CustomerId")
joined.dt1.dt.2 <- dt1[dt2]
merge
also works on data.tables (as it is generic and calls merge.data.table
)
merge(dt1, dt2)
Yet another option is the join
function found in the plyr package
library(plyr)
join(df1, df2,
type = "inner")
# CustomerId Product State
# 1 2 Toaster Alabama
# 2 4 Radio Alabama
# 3 6 Radio Ohio
Options for type
: inner
, left
, right
, full
.
From ?join
: Unlike merge
, [join
] preserves the order of x no matter what join type is used.
How to join (merge) data frames (inner, outer, left, right) in R: Answer #4:
There are some good examples of doing this over at the R Wiki. I’ll steal a couple here:
Merge Method
Since your keys are named the same the short way to do an inner join is merge():
merge(df1,df2)
a full inner join (all records from both tables) can be created with the “all” keyword:
merge(df1,df2, all=TRUE)
a left outer join of df1 and df2:
merge(df1,df2, all.x=TRUE)
a right outer join of df1 and df2:
merge(df1,df2, all.y=TRUE)
you can flip ’em, slap ’em and rub ’em down to get the other two outer joins you asked about 🙂
Subscript Method
A left outer join with df1 on the left using a subscript method would be:
df1[,"State"]<-df2[df1[ ,"Product"], "State"]
The other combination of outer joins can be created by mungling the left outer join subscript example. (yeah, I know that’s the equivalent of saying “I’ll leave it as an exercise for the reader…”)
How to join (merge) data frames (inner, outer, left, right) in R: Answer #5:
Update on data.table methods for joining datasets. See below examples for each type of join. There are two methods, one from [.data.table
when passing second data.table as the first argument to subset, another way is to use merge
function which dispatches to fast data.table method.
df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2L, 4L, 7L), State = c(rep("Alabama", 2), rep("Ohio", 1))) # one value changed to show full outer join
library(data.table)
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
setkey(dt1, CustomerId)
setkey(dt2, CustomerId)
# right outer join keyed data.tables
dt1[dt2]
setkey(dt1, NULL)
setkey(dt2, NULL)
# right outer join unkeyed data.tables - use `on` argument
dt1[dt2, on = "CustomerId"]
# left outer join - swap dt1 with dt2
dt2[dt1, on = "CustomerId"]
# inner join - use `nomatch` argument
dt1[dt2, nomatch=NULL, on = "CustomerId"]
# anti join - use `!` operator
dt1[!dt2, on = "CustomerId"]
# inner join - using merge method
merge(dt1, dt2, by = "CustomerId")
# full outer join
merge(dt1, dt2, by = "CustomerId", all = TRUE)
# see ?merge.data.table arguments for other cases
Below benchmark tests base R, sqldf, dplyr and data.table.
Benchmark tests unkeyed/unindexed datasets. Benchmark is performed on 50M-1 rows datasets, there are 50M-2 common values on join column so each scenario (inner, left, right, full) can be tested and join is still not trivial to perform. It is type of join which well stress join algorithms. Timings are as of sqldf:0.4.11
, dplyr:0.7.8
, data.table:1.12.0
.
# inner
Unit: seconds
expr min lq mean median uq max neval
base 111.66266 111.66266 111.66266 111.66266 111.66266 111.66266 1
sqldf 624.88388 624.88388 624.88388 624.88388 624.88388 624.88388 1
dplyr 51.91233 51.91233 51.91233 51.91233 51.91233 51.91233 1
DT 10.40552 10.40552 10.40552 10.40552 10.40552 10.40552 1
# left
Unit: seconds
expr min lq mean median uq max
base 142.782030 142.782030 142.782030 142.782030 142.782030 142.782030
sqldf 613.917109 613.917109 613.917109 613.917109 613.917109 613.917109
dplyr 49.711912 49.711912 49.711912 49.711912 49.711912 49.711912
DT 9.674348 9.674348 9.674348 9.674348 9.674348 9.674348
# right
Unit: seconds
expr min lq mean median uq max
base 122.366301 122.366301 122.366301 122.366301 122.366301 122.366301
sqldf 611.119157 611.119157 611.119157 611.119157 611.119157 611.119157
dplyr 50.384841 50.384841 50.384841 50.384841 50.384841 50.384841
DT 9.899145 9.899145 9.899145 9.899145 9.899145 9.899145
# full
Unit: seconds
expr min lq mean median uq max neval
base 141.79464 141.79464 141.79464 141.79464 141.79464 141.79464 1
dplyr 94.66436 94.66436 94.66436 94.66436 94.66436 94.66436 1
DT 21.62573 21.62573 21.62573 21.62573 21.62573 21.62573 1
Be aware there are other types of joins you can perform using data.table
:
– update on join – if you want to lookup values from another table to your main table
– aggregate on join – if you want to aggregate on key you are joining you do not have to materialize all join results
– overlapping join – if you want to merge by ranges
– rolling join – if you want merge to be able to match to values from preceeding/following rows by rolling them forward or backward
– non-equi join – if your join condition is non-equal
Code to reproduce:
library(microbenchmark)
library(sqldf)
library(dplyr)
library(data.table)
sapply(c("sqldf","dplyr","data.table"), packageVersion, simplify=FALSE)
n = 5e7
set.seed(108)
df1 = data.frame(x=sample(n,n-1L), y1=rnorm(n-1L))
df2 = data.frame(x=sample(n,n-1L), y2=rnorm(n-1L))
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
mb = list()
# inner join
microbenchmark(times = 1L,
base = merge(df1, df2, by = "x"),
sqldf = sqldf("SELECT * FROM df1 INNER JOIN df2 ON df1.x = df2.x"),
dplyr = inner_join(df1, df2, by = "x"),
DT = dt1[dt2, nomatch=NULL, on = "x"]) -> mb$inner
# left outer join
microbenchmark(times = 1L,
base = merge(df1, df2, by = "x", all.x = TRUE),
sqldf = sqldf("SELECT * FROM df1 LEFT OUTER JOIN df2 ON df1.x = df2.x"),
dplyr = left_join(df1, df2, by = c("x"="x")),
DT = dt2[dt1, on = "x"]) -> mb$left
# right outer join
microbenchmark(times = 1L,
base = merge(df1, df2, by = "x", all.y = TRUE),
sqldf = sqldf("SELECT * FROM df2 LEFT OUTER JOIN df1 ON df2.x = df1.x"),
dplyr = right_join(df1, df2, by = "x"),
DT = dt1[dt2, on = "x"]) -> mb$right
# full outer join
microbenchmark(times = 1L,
base = merge(df1, df2, by = "x", all = TRUE),
dplyr = full_join(df1, df2, by = "x"),
DT = merge(dt1, dt2, by = "x", all = TRUE)) -> mb$full
lapply(mb, print) -> nul
How to join (merge) data frames (inner, outer, left, right) in R: Answer #6:
Update join. One other important SQL-style join is an “update join” where columns in one table are updated (or created) using another table.
Modifying the OP’s example tables…
sales = data.frame(
CustomerId = c(1, 1, 1, 3, 4, 6),
Year = 2000:2005,
Product = c(rep("Toaster", 3), rep("Radio", 3))
)
cust = data.frame(
CustomerId = c(1, 1, 4, 6),
Year = c(2001L, 2002L, 2002L, 2002L),
State = state.name[1:4]
)
sales
# CustomerId Year Product
# 1 2000 Toaster
# 1 2001 Toaster
# 1 2002 Toaster
# 3 2003 Radio
# 4 2004 Radio
# 6 2005 Radio
cust
# CustomerId Year State
# 1 2001 Alabama
# 1 2002 Alaska
# 4 2002 Arizona
# 6 2002 Arkansas
Suppose we want to add the customer’s state from cust
to the purchases table, sales
, ignoring the year column. With base R, we can identify matching rows and then copy values over:
sales$State <- cust$State[ match(sales$CustomerId, cust$CustomerId) ]
# CustomerId Year Product State
# 1 2000 Toaster Alabama
# 1 2001 Toaster Alabama
# 1 2002 Toaster Alabama
# 3 2003 Radio <NA>
# 4 2004 Radio Arizona
# 6 2005 Radio Arkansas
# cleanup for the next example
sales$State <- NULL
As can be seen here, match
selects the first matching row from the customer table.
Update join with multiple columns. The approach above works well when we are joining on only a single column and are satisfied with the first match. Suppose we want the year of measurement in the customer table to match the year of sale.
As @bgoldst’s answer mentions, match
with interaction
might be an option for this case. More straightforwardly, one could use data.table:
library(data.table)
setDT(sales); setDT(cust)
sales[, State := cust[sales, on=.(CustomerId, Year), x.State]]
# CustomerId Year Product State
# 1: 1 2000 Toaster <NA>
# 2: 1 2001 Toaster Alabama
# 3: 1 2002 Toaster Alaska
# 4: 3 2003 Radio <NA>
# 5: 4 2004 Radio <NA>
# 6: 6 2005 Radio <NA>
# cleanup for next example
sales[, State := NULL]
Rolling update join. Alternately, we may want to take the last state the customer was found in:
sales[, State := cust[sales, on=.(CustomerId, Year), roll=TRUE, x.State]]
# CustomerId Year Product State
# 1: 1 2000 Toaster <NA>
# 2: 1 2001 Toaster Alabama
# 3: 1 2002 Toaster Alaska
# 4: 3 2003 Radio <NA>
# 5: 4 2004 Radio Arizona
# 6: 6 2005 Radio Arkansas
The three examples above all focus on creating/adding a new column.
Hope your queries have been resolved.
Follow Programming Articles for more!