Confusion matrix confusion: factor level order in binary classification

Here is a not so endearing quirk in R: you have a binary classification problem for which you build say a logistic regression model. In order to conveniently work with the data and straightforwardly interpret the coefficients I recommend:

  1. factorize your target variable
  2. set the levels of the factor so that the level you want to predict is the second

E.g. if you want to predict fraud the levels should be c(“nofraud”, “fraud”). Of course the the string doesn’t matter as long is it can be also a valid column name (so you can’t have levels like “1” and “2”, nor T/F or TRUE/FALSE), the reason being that some algos do dummification and try to set the dummy column name to the factor levels.

OK, so if you do 1. and 2., your levels get encoded to 0 and 1 respectively: which means that a positive coefficient will implies higher odds for the class you wanted to predict: just as it would make sense.

Now the gotcha: if you now calculate the confusion matrix with confusionMatrix from the caret library, by default it sets the “positive” class to the first level, not the second one. Which means that all the parameters (sensitivity etc.) will get calculated with respect to the wrong class. 

To fix it, call confusionMatrix with explicitly the level you need:

confusionMatrix(fit.predicted.class, data[, target], positive=”fraud”)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s