How to sum a variable by group in R? [Answered]

Sample problem:

I have a data frame with two columns. First column contains categories such as “First”, “Second”, “Third”, and the second column has numbers that represent the number of times I saw the specific groups from “Category”.

For example:

Category     Frequency
First        10
First        15
First        5
Second       2
Third        14
Third        20
Second       3

I want to sort the data by Category and sum all the Frequencies:

Category     Frequency
First        30
Second       5
Third        34

How would I do this in R?

How to sum a variable by group? Answer #1:

Using aggregate:

aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)
  Category  x
1    First 30
2   Second  5
3    Third 34

In the example above, multiple dimensions can be specified in the list. Multiple aggregated metrics of the same data type can be incorporated via cbind:

aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...

(embedding @thelatemail comment), aggregate has a formula interface too

aggregate(Frequency ~ Category, x, sum)

Or if you want to aggregate multiple columns, you could use the . notation (works for one column too)

aggregate(. ~ Category, x, sum)

or tapply:

tapply(x$Frequency, x$Category, FUN=sum)
 First Second  Third 
    30      5     34 

Using this data:

x <- data.frame(Category=factor(c("First", "First", "First", "Second",
                                      "Third", "Third", "Second")), 
                    Frequency=c(10,15,5,2,14,20,3))

Answer #2:

You can also use the dplyr package for that purpose:

library(dplyr)
x %>% 
  group_by(Category) %>% 
  summarise(Frequency = sum(Frequency))

#Source: local data frame [3 x 2]
#
#  Category Frequency
#1    First        30
#2   Second         5
#3    Third        34

Or, for multiple summary columns (works with one column too):

x %>% 
  group_by(Category) %>% 
  summarise(across(everything(), sum))

Here are some more examples of how to summarise data by group using dplyr functions using the built-in dataset mtcars:

# several summary columns with arbitrary names
mtcars %>% 
  group_by(cyl, gear) %>%                            # multiple group columns
  summarise(max_hp = max(hp), mean_mpg = mean(mpg))  # multiple summary columns

# summarise all columns except grouping columns using "sum" 
mtcars %>% 
  group_by(cyl) %>% 
  summarise(across(everything(), sum))

# summarise all columns except grouping columns using "sum" and "mean"
mtcars %>% 
  group_by(cyl) %>% 
  summarise(across(everything(), list(mean = mean, sum = sum)))

# multiple grouping columns
mtcars %>% 
  group_by(cyl, gear) %>% 
  summarise(across(everything(), list(mean = mean, sum = sum)))

# summarise specific variables, not all
mtcars %>% 
  group_by(cyl, gear) %>% 
  summarise(across(c(qsec, mpg, wt), list(mean = mean, sum = sum)))

# summarise specific variables (numeric columns except grouping columns)
mtcars %>% 
  group_by(gear) %>% 
  summarise(across(where(is.numeric), list(mean = mean, sum = sum)))

Answer #3:

The answer provided above works and is simple. However, if you are handling larger datasets and need a performance boost there is a faster alternative:

library(data.table)
data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"), 
                  Frequency=c(10,15,5,2,14,20,3))
data[, sum(Frequency), by = Category]
#    Category V1
# 1:    First 30
# 2:   Second  5
# 3:    Third 34
system.time(data[, sum(Frequency), by = Category] )
# user    system   elapsed 
# 0.008     0.001     0.009 

Let’s compare that to the same thing using data.frame and the above above:

data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"),
                  Frequency=c(10,15,5,2,14,20,3))
system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum))
# user    system   elapsed 
# 0.008     0.000     0.015 

And if you want to keep the column this is the syntax:

data[,list(Frequency=sum(Frequency)),by=Category]
#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34

The difference will become more noticeable with larger datasets, as the code below demonstrates:

data = data.table(Category=rep(c("First", "Second", "Third"), 100000),
                  Frequency=rnorm(100000))
system.time( data[,sum(Frequency),by=Category] )
# user    system   elapsed 
# 0.055     0.004     0.059 
data = data.frame(Category=rep(c("First", "Second", "Third"), 100000), 
                  Frequency=rnorm(100000))
system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) )
# user    system   elapsed 
# 0.287     0.010     0.296 

For multiple aggregations, you can combine lapply and .SD as follows

data[, lapply(.SD, sum), by = Category]
#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34

Answer #4:

You can also use the by() function:

x2 <- by(x$Frequency, x$Category, sum)
do.call(rbind,as.list(x2))

Those other packages (plyr, reshape) have the benefit of returning a data.frame, but it’s worth being familiar with by() since it’s a base function.

Hope you learned something from this post.

Follow Programming Articles for more!

About ᴾᴿᴼᵍʳᵃᵐᵐᵉʳ

Linux and Python enthusiast, in love with open source since 2014, Writer at programming-articles.com, India.

View all posts by ᴾᴿᴼᵍʳᵃᵐᵐᵉʳ →