Handling Class Imbalance with R
실제 데이터에 머신러닝을 적용하려고 하다보면 한 쪽의 class가 과도하게 큰 문제인 class imbalance 문제에 직면하게 된다.
Method to improve performance on imbalanced data
Class weights: impose a heavier cost when errors are made in the minority class
Down-sampling: randomly remove instances in the majority class
Up-sampling: randomly replicate instances in the minority class
Synthetic minority sampling technique (SMOTE): down samples the majority class and synthesizes new minority instances by interpolating between existing ones
SMOTE
를 그림으로 설명하면 아래와 같다.
Imbalanced Classification in R
class imbalance 문제를 해결할 수 있는방법은 ROSE
와 DMwR
페키지를 사용하는 것이다.
여기서 사용하는 예제는 Binary Classification
문제를 다룬다.
- ROSE: Random over smapling examples packages 이다. 이것은 artifical data를 샘플링 방법에 기반해서 생성 하게 된다.
install.packages("ROSE")
library(ROSE)
내부적으로 ROSE
페키지는 imbalanced data의 하나인 hacide.train
과 hacide.test
를 보유하고 있다.
SMOTE 방법
oversampling 방법은 너무 중복된 값이 많이 생성되고 undersampling은 중요한 데이터를 너무 많이 잃어 버린다.ROSE
는 이러한 문제를 해결 한다.
구현 코드
library(ROSE)
data(hacide) # imbalanced data
str(hacide.train)
str(hacide.test)
# check table
table(hacide.train$cls)
table(hacide.test$cls)
# check classes distribution
prop.table(table(hacide.train$cls))
prop.table(table(hacide.test$cls))
# only 2% of data is positive.
# it is a severely imbalanced data set.
library(rpart)
treeimb <- rpart(cls ~ ., data = hacide.train)
pred.treeimb <- predict(treeimb, newdata = hacide.test)
# check model accuracy
accuracy.meas(hacide.test$cls, pred.treeimb[,2])
roc.curve(hacide.test$cls, pred.treeimb[,2], plotit = F)
# over sampling
data_balanced_over <- ovun.sample(cls~., data = hacide.train, method = "over", N=1960)$data
# N refers to number of observations in the resulting balanced set.
table(data_balanced_over$cls)
# under sampling
data_balanced_under <- ovun.sample(cls~., data = hacide.train, method = "under", N=40)$data
# under and over smapling (both)
# the minority class is oversampled with replacement and majority class is undersampled without replacement
data_balanced_both <- ovun.sample(cls ~ ., data = hacide.train, method = "both", p=0.5, N=1000, seed = 1)$data
table(data_balanced_both$cls)
data.rose <- ROSE(cls ~ ., data = hacide.train, seed = 1)$data
table(data.rose$cls)
#build decision tree models
tree.rose <- rpart(cls ~ ., data = data.rose)
tree.over <- rpart(cls ~ ., data = data_balanced_over)
tree.under <- rpart(cls ~ ., data = data_balanced_under)
tree.both <- rpart(cls ~ ., data = data_balanced_both)
#make predictions on unseen data
pred.tree.rose <- predict(tree.rose, newdata = hacide.test)
pred.tree.over <- predict(tree.over, newdata = hacide.test)
pred.tree.under <- predict(tree.under, newdata = hacide.test)
pred.tree.both <- predict(tree.both, newdata = hacide.test)
#AUC ROSE
roc.curve(hacide.test$cls, pred.tree.rose[,2])
#AUC Oversampling
roc.curve(hacide.test$cls, pred.tree.over[,2])
#AUC Undersampling
roc.curve(hacide.test$cls, pred.tree.under[,2])
#AUC Both
roc.curve(hacide.test$cls, pred.tree.both[,2])
실행결과
> roc.curve(hacide.test$cls, pred.tree.rose[,2])
Area under the curve (AUC): 0.989
> roc.curve(hacide.test$cls, pred.tree.over[,2], add.roc=TRUE)
Area under the curve (AUC): 0.798
> roc.curve(hacide.test$cls, pred.tree.under[,2], add.roc=TRUE)
Area under the curve (AUC): 0.876
> roc.curve(hacide.test$cls, pred.tree.both[,2], add.roc=TRUE)
Area under the curve (AUC): 0.798
결론적으로 SMOTE
방법을 구현한 ROSE package
를 이용한 방법이 가장 정확도가 높게 나온다.
참고문헌
[1] Wicked Good Data
[2] Silicon Valley Data Science blog post
[3] SMOTE Implementation in Python
[4] https://www.analyticsvidhya.com/blog/2016/03/practical-guide-deal-imbalanced-classification-problems/
'AI > Machine Learning with R' 카테고리의 다른 글
Feature Selection with Caret (Auto) (0) | 2016.11.20 |
---|---|
Caret의 이해 (0) | 2016.03.06 |
Cross Validation (0) | 2016.02.26 |
Ensemble method: Bagging (bootstrap aggregating) (0) | 2015.11.19 |
Bootstrapping (0) | 2015.11.19 |