Handling Class Imbalance with R


실제 데이터에 머신러닝을 적용하려고 하다보면 한 쪽의 class가 과도하게 큰 문제인 class imbalance 문제에 직면하게 된다.

Method to improve performance on imbalanced data

  • Class weights: impose a heavier cost when errors are made in the minority class

  • Down-sampling: randomly remove instances in the majority class

  • Up-sampling: randomly replicate instances in the minority class

  • Synthetic minority sampling technique (SMOTE): down samples the majority class and synthesizes new minority instances by interpolating between existing ones

SMOTE를 그림으로 설명하면 아래와 같다.

Imbalanced Classification in R

class imbalance 문제를 해결할 수 있는방법은 ROSE DMwR 페키지를 사용하는 것이다.
여기서 사용하는 예제는 Binary Classification문제를 다룬다.

  • ROSE: Random over smapling examples packages 이다. 이것은 artifical data를 샘플링 방법에 기반해서 생성 하게 된다.
install.packages("ROSE")
library(ROSE)

내부적으로 ROSE 페키지는 imbalanced data의 하나인 hacide.train hacide.test를 보유하고 있다.

SMOTE 방법

oversampling 방법은 너무 중복된 값이 많이 생성되고 undersampling은 중요한 데이터를 너무 많이 잃어 버린다.
ROSE는 이러한 문제를 해결 한다.

구현 코드

library(ROSE)

data(hacide) # imbalanced data
str(hacide.train)
str(hacide.test)


# check table
table(hacide.train$cls)
table(hacide.test$cls)

# check classes distribution
prop.table(table(hacide.train$cls))
prop.table(table(hacide.test$cls))


# only 2% of data is positive. 
# it is a severely imbalanced data set.

library(rpart)
treeimb <- rpart(cls ~ ., data = hacide.train)
pred.treeimb <- predict(treeimb, newdata = hacide.test)

# check model accuracy 
accuracy.meas(hacide.test$cls, pred.treeimb[,2])
roc.curve(hacide.test$cls, pred.treeimb[,2], plotit = F)


# over sampling
data_balanced_over <- ovun.sample(cls~., data = hacide.train, method = "over", N=1960)$data
# N refers to number of observations in the resulting balanced set.

table(data_balanced_over$cls)

# under sampling
data_balanced_under <- ovun.sample(cls~., data = hacide.train, method = "under", N=40)$data

# under and over smapling (both)
# the minority class is oversampled with replacement and majority class is undersampled without replacement
data_balanced_both <- ovun.sample(cls ~ ., data = hacide.train, method = "both", p=0.5, N=1000, seed = 1)$data
table(data_balanced_both$cls)

data.rose <- ROSE(cls ~ ., data = hacide.train, seed = 1)$data
table(data.rose$cls)


#build decision tree models
tree.rose <- rpart(cls ~ ., data = data.rose)
tree.over <- rpart(cls ~ ., data = data_balanced_over)
tree.under <- rpart(cls ~ ., data = data_balanced_under)
tree.both <- rpart(cls ~ ., data = data_balanced_both)

#make predictions on unseen data
pred.tree.rose <- predict(tree.rose, newdata = hacide.test)
pred.tree.over <- predict(tree.over, newdata = hacide.test)
pred.tree.under <- predict(tree.under, newdata = hacide.test)
pred.tree.both <- predict(tree.both, newdata = hacide.test)

#AUC ROSE
roc.curve(hacide.test$cls, pred.tree.rose[,2])

#AUC Oversampling
roc.curve(hacide.test$cls, pred.tree.over[,2])

#AUC Undersampling
roc.curve(hacide.test$cls, pred.tree.under[,2])

#AUC Both
roc.curve(hacide.test$cls, pred.tree.both[,2])

실행결과

> roc.curve(hacide.test$cls, pred.tree.rose[,2])
Area under the curve (AUC): 0.989
> roc.curve(hacide.test$cls, pred.tree.over[,2], add.roc=TRUE)
Area under the curve (AUC): 0.798
> roc.curve(hacide.test$cls, pred.tree.under[,2], add.roc=TRUE)
Area under the curve (AUC): 0.876
> roc.curve(hacide.test$cls, pred.tree.both[,2], add.roc=TRUE)
Area under the curve (AUC): 0.798

결론적으로 SMOTE방법을 구현한 ROSE package를 이용한 방법이 가장 정확도가 높게 나온다.

참고문헌

[1] Wicked Good Data
[2] Silicon Valley Data Science blog post

[3] SMOTE Implementation in Python
[4] https://www.analyticsvidhya.com/blog/2016/03/practical-guide-deal-imbalanced-classification-problems/

[5] http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/


'AI > Machine Learning with R' 카테고리의 다른 글

Feature Selection with Caret (Auto)  (0) 2016.11.20
Caret의 이해  (0) 2016.03.06
Cross Validation  (0) 2016.02.26
Ensemble method: Bagging (bootstrap aggregating)  (0) 2015.11.19
Bootstrapping  (0) 2015.11.19

+ Recent posts