### Sample problem:

I have a data frame with two columns. First column contains categories such as “First”, “Second”, “Third”, and the second column has numbers that represent the number of times I saw the specific groups from “Category”.

For example:

```
Category Frequency
First 10
First 15
First 5
Second 2
Third 14
Third 20
Second 3
```

I want to sort the data by Category and sum all the Frequencies:

```
Category Frequency
First 30
Second 5
Third 34
```

How would I do this in R?

## How to sum a variable by group? Answer #1:

Using `aggregate`

:

```
aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)
Category x
1 First 30
2 Second 5
3 Third 34
```

In the example above, multiple dimensions can be specified in the `list`

. Multiple aggregated metrics of the same data type can be incorporated via `cbind`

:

```
aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...
```

(embedding @thelatemail comment), `aggregate`

has a formula interface too

```
aggregate(Frequency ~ Category, x, sum)
```

Or if you want to aggregate multiple columns, you could use the `.`

notation (works for one column too)

```
aggregate(. ~ Category, x, sum)
```

or `tapply`

:

```
tapply(x$Frequency, x$Category, FUN=sum)
First Second Third
30 5 34
```

Using this data:

```
x <- data.frame(Category=factor(c("First", "First", "First", "Second",
"Third", "Third", "Second")),
Frequency=c(10,15,5,2,14,20,3))
```

## Answer #2:

You can also use the **dplyr** package for that purpose:

```
library(dplyr)
x %>%
group_by(Category) %>%
summarise(Frequency = sum(Frequency))
#Source: local data frame [3 x 2]
#
# Category Frequency
#1 First 30
#2 Second 5
#3 Third 34
```

Or, for **multiple summary columns** (works with one column too):

```
x %>%
group_by(Category) %>%
summarise(across(everything(), sum))
```

Here are some more examples of how to summarise data by group using dplyr functions using the built-in dataset `mtcars`

:

```
# several summary columns with arbitrary names
mtcars %>%
group_by(cyl, gear) %>% # multiple group columns
summarise(max_hp = max(hp), mean_mpg = mean(mpg)) # multiple summary columns
# summarise all columns except grouping columns using "sum"
mtcars %>%
group_by(cyl) %>%
summarise(across(everything(), sum))
# summarise all columns except grouping columns using "sum" and "mean"
mtcars %>%
group_by(cyl) %>%
summarise(across(everything(), list(mean = mean, sum = sum)))
# multiple grouping columns
mtcars %>%
group_by(cyl, gear) %>%
summarise(across(everything(), list(mean = mean, sum = sum)))
# summarise specific variables, not all
mtcars %>%
group_by(cyl, gear) %>%
summarise(across(c(qsec, mpg, wt), list(mean = mean, sum = sum)))
# summarise specific variables (numeric columns except grouping columns)
mtcars %>%
group_by(gear) %>%
summarise(across(where(is.numeric), list(mean = mean, sum = sum)))
```

## Answer #3:

The answer provided above works and is simple. However, if you are handling larger datasets and need a performance boost there is a faster alternative:

```
library(data.table)
data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"),
Frequency=c(10,15,5,2,14,20,3))
data[, sum(Frequency), by = Category]
# Category V1
# 1: First 30
# 2: Second 5
# 3: Third 34
system.time(data[, sum(Frequency), by = Category] )
# user system elapsed
# 0.008 0.001 0.009
```

Let’s compare that to the same thing using data.frame and the above above:

```
data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"),
Frequency=c(10,15,5,2,14,20,3))
system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum))
# user system elapsed
# 0.008 0.000 0.015
```

And if you want to keep the column this is the syntax:

```
data[,list(Frequency=sum(Frequency)),by=Category]
# Category Frequency
# 1: First 30
# 2: Second 5
# 3: Third 34
```

The difference will become more noticeable with larger datasets, as the code below demonstrates:

```
data = data.table(Category=rep(c("First", "Second", "Third"), 100000),
Frequency=rnorm(100000))
system.time( data[,sum(Frequency),by=Category] )
# user system elapsed
# 0.055 0.004 0.059
data = data.frame(Category=rep(c("First", "Second", "Third"), 100000),
Frequency=rnorm(100000))
system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) )
# user system elapsed
# 0.287 0.010 0.296
```

For multiple aggregations, you can combine `lapply`

and `.SD`

as follows

```
data[, lapply(.SD, sum), by = Category]
# Category Frequency
# 1: First 30
# 2: Second 5
# 3: Third 34
```

## Answer #4:

You can also use the **by()** function:

```
x2 <- by(x$Frequency, x$Category, sum)
do.call(rbind,as.list(x2))
```

Those other packages (plyr, reshape) have the benefit of returning a data.frame, but it’s worth being familiar with by() since it’s a base function.

Hope you learned something from this post.

Follow **Programming Articles** for more!