Confusion matrix in R

In this article, we will discuss what is confusion matrix and how to calculate a confusion matrix in R. 

What is a Confusion Matrix?

In simple words, a confusion matrix is a technique for summarizing the performance of a classification algorithm.

The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix.

The confusion matrix shows the ways in which your classification model is confused when it makes predictions.

It gives you insight not only into the errors being made by your classifier but more importantly the types of errors that are being made.

It is this breakdown that overcomes the limitation of using classification accuracy alone.

Confusion matrix explained:

A confusion matrix, or an error matrix, deals with the standard classification problem in statistics. It comprises a specific table layout that facilitates data analysts to visualize how an algorithm performs. This particularly applies to supervised learning algorithms.

To elaborate further, a confusion matrix follows a N x N format, where N refers to the number to target classes. You can use this table or matrix to evaluate a classification model’s performance. This is possible because the matrix compares the predicted values with the target values. 

In a nutshell, you can describe how your machine learning model, a classifier, in this case, works on a set of test data (for which you already have the true values).

To understand this method, you need to be familiar with the following terms:

  • True Positive (TP): Positive values are correctly predicted
  • False Positive (FP): Negative values are incorrectly predicted as positive
  • False Negative (FN): Positive values predicted as negative
  • True Negative (TN): Negative values predicted as actual negative values

Let us look at some examples to gain more clarity.

Confusion Matrix Examples

  • True Positive

When you had predicted India to win the Cricket World Cup, and it won.

  • False Positive

When you had expected India to win, but it lost.

  • False Negative

When you had predicted that France would not win, but it won.

  • True Negative

When you projected that India would ‘not win’ the Cricket world cup and it lost the series in real life.

As we move further, you should remember that all predicted values are described as: Positive, Negative, True, and False.

How to Calculate a Confusion Matrix

Below is the process for calculating a confusion Matrix.

  1. You need a test dataset or a validation dataset with expected outcome values.
  2. Make a prediction for each row in your test dataset.
  3. From the expected outcomes and predictions count:
    1. The number of correct predictions for each class.
    2. The number of incorrect predictions for each class, organized by the class that was predicted.

These numbers are then organized into a table, or a matrix as follows:

  • Expected down the side: Each row of the matrix corresponds to a predicted class.
  • Predicted across the top: Each column of the matrix corresponds to an actual class.

The counts of correct and incorrect classification are then filled into the table.

The total number of correct predictions for a class go into the expected row for that class value and the predicted column for that class value.

In the same way, the total number of incorrect predictions for a class go into the expected row for that class value and the predicted column for that class value.

In practice, a binary classifier such as this one can make two types of errors: it can incorrectly assign an individual who defaults to the no default category, or it can incorrectly assign an individual who does not default to the default category. It is often of interest to determine which of these two types of errors are being made. A confusion matrix […] is a convenient way to display this information.

An Introduction to Statistical Learning: with Applications in R, 2014

How to Calculate the Confusion Matrix in R?

Before we move on to the technicalities, let us first understand why we have chosen R for this purpose. It is because of the following benefits that this programming language is gaining popularity among statisticians and data scientists worldwide: 

  • Reproducible: With R, you can reproduce reports and write reusable code 
  • Shareable: It has a low learning curve, which opens up avenues for collaboration
  • Repeatable: Anyone can not only understand what you did but also repeat the steps to create the same functions on their machines

Consider a scenario where you have a list of expected or known values and another list of predictions from your machine learning model. In R, you can calculate the confusion matrix using a simple function from the caret library: confusionMatrix(). It can not only calculate the matrix but also return a detailed report for the results.

You can follow the below-mentioned steps to practice the process of data mining: 

  • Test the given dataset with the expected outcomes.
  • Predict the rows of your test dataset.
  • Determine the total counts of correct and incorrect predictions for each class.

Once you have done this, you will find the numbers organized in the following fashion:

  • Every row of the matrix will correspond with a predicted class and every column will be linked to an actual class.
  • The total number of correct and incorrect classifications are reflected in the table, along with the sums for each class.

Suppose you have 10 persons divided into two classes, male and female. You have to arrange the information as a confusion matrix when you know that 2 men were classified as women, while 1 woman was classified as a man.

                     women     men

women            3              1

men                2               4

Here, the correct values are organized in a diagonal line from the top left to the bottom-right of the matrix (3 + 4). The results tell us that there more errors with predicting male members as women than predicting females as men. The algorithm made 7 correct predictions out of 10 possible outcomes, which means it has a 70% accuracy.

Guide to Making and Calculating a Confusion Matrix in R

As you can observe, the confusion matrix function is a useful tool for examining the possible outcomes of your predictions. So, before you begin creating your matrix, you first need to have a “cut” of your probability values. In other words, you need to mark a threshold to turn your probabilities into class predictions.

To do this, you can use the ifelse() function. For example:

class_prediction <-

 ifelse (probability_prediction > 0.50,

         “positive_class”,

         “negative_class”

  )

You can also write the table() function to make a contingency table in base R. However, the confusionMatrix() function is known to yield valuable ancillary statistics. 

The next step is to calculate the confusion matrix and other associated stats. Here, you would need the predicted and actual outcomes. Take, for instance, the statement given below:

confusionMatrix(predicted, actual)

Now, you should proceed with turning your numeric predictions into a vector of class predictions, sat p_class. Suppose you want to use a cutoff of 0.50. 

Also, while making predictions, don’t forget to name the positive and negative classes with separate indicators. Let’s call the positive classes “T” and name the negative ones as “L”. This is done to match classes with the original data.

Now that you have a p_class and actual values in the test dataset, you can start making your confusion matrix, calling the confusionMatrix() function.

Alternatively, you may want to be sure about the accuracy of your data mining model. In such cases, it is advisable to use a threshold of 0.10, not 0.90. thereafter, you can continue with the same steps as you did in the earlier exercise.

With your new predicted classes, you can repeat this call:

pred <- ifelse(probability > threshold, “T”, “L”)

Finally, you can use the confusionMatrix() function in caret:

confusionMatrix(predicted, actual)

Let me share the official documentation on confusionMatrix() function in R.

confusionMatrix: Create a confusion matrix in R

Description

Calculates a cross-tabulation of observed and predicted classes with associated statistics.

Usage

## S3 method for class 'default':
confusionMatrix(data, reference, positive = NULL, dnn = c("Prediction", "Reference"), ...)

Arguments

dataa factor of predicted classesreferencea factor of classes to be used as the true resultspositivean optional character string for the factor level that corresponds to a “positive” result (if that makes sense for your data). If there are only two factor levels, the first level will be used as the “positive” result.dnna character vector of dimnames for the table…options to be passed to table. NOTE: do not include dnn here

Value

  • a list with elements
  • tablethe results of table on data and reference
  • positivethe positive result level
  • overalla numeric vector with overall accuracy and Kappa statistic values
  • byClassthe sensitivity, specificity, positive predictive value and negative predictive value for each class. For two class systems, this is calculated once using the positive argument

Details

The functions requires that the factors have exactly the same levels.

For two class problems, the sensitivity, specificity, positive predictive value and negative predictive value is calculated using the positive argument. For more than two classes, these results are calculated comparing each factor level to the remaining levels (i.e. a “one versus all” approach). In each case, the overall accuracy and Kappa statistic are calculated.

The overall accuracy rate is computed along with a 95 percent confidence interval for this rate (using binom.test) and a one-sided test to see if the accuracy is better than the “no information rate,” which is taken to be the largest class percentage in the data.

Examples

numLlvs <- 4
confusionMatrix(
   factor(sample(rep(letters[1:numLlvs], 200), 50)),
   factor(sample(rep(letters[1:numLlvs], 200), 50)))  
   
numLlvs <- 2
confusionMatrix(
   factor(sample(rep(letters[1:numLlvs], 200), 50)),
   factor(sample(rep(letters[1:numLlvs], 200), 50)))

What is the need of creating a confusion matrix?

The following reasons introduce us to the benefits of having a confusion matrix and how it deals with performance issues.

  1. The confusion matrix is needed to eradicate the issue with classification accuracy. The classification ratio often causes some problems by concealing the necessary details of the model.
  2. The confusion matrix gives an insight into the predictions, and type of errors made in the classification model. The correct and faulty predictions are presented in a summarized manner.
  3. The errors and their types are classified to give you an understanding of the performance of your model.

How to calculate the confusion matrix in R?

The confusion matrix in R can be calculated by using the “confusionMatrix()” function of the caret library. This function not only calculates the matrix but also returns a detailed report of the matrix. You must follow some steps in order to calculate your confusion matrix.

  1. Test your dataset.
  2. Predict its total number of rows.
  3. Predict the total correct and incorrect predictions for every class.
    Once you have mined your data, you will get the numbers organized in rows. The rows will be connected to the predicted class while the columns will be connected to the actual class. The correct values will be in a diagonal line. Add all the values and you will get the accuracy percentage of your matrix.

How to measure the performance in a confusion matrix?

You can calculate the accuracy rate of a model by using a 2×2 confusion matrix. The following formula will get you the success rate or the accuracy rate:
Accuracy = (TP+TN)/(TP+TN+FP+FN)
Where, TP = True Positive ,TN = True Negative,FP = False Positive, FN = False Negative


The error rate of your model can also be calculated with the rate calculating formula which is:
Accuracy = (TP+TN)/(TP+TN+FP+FN) = 1-Accuracy
The concept of the error rate is very simple. Suppose your model has an accuracy rate of 80% then the error rate of your model will be 20%.

Hope you learned something from this post.

Follow Programming Articles for more!

About ᴾᴿᴼᵍʳᵃᵐᵐᵉʳ

Linux and Python enthusiast, in love with open source since 2014, Writer at programming-articles.com, India.

View all posts by ᴾᴿᴼᵍʳᵃᵐᵐᵉʳ →