Cross Validation
Ch.10 Evaluating Model Performance
k-fold cross-validation에 대해서 다룬다.
K는 통상 10을 선택 한다.
After the process of training and evaluating the model has occured for 10 times (with 10 different training/testing combinations), the average performance across all the folds is reported.
k가 너무 크면, leave-one-out method 방법이다. 이 방법은 training에 사용할 데이터가 너무 적을 때 사용하는 방법이다.
createFolds
아래의 함수를 이용해서 k-fold를 수행 할 수 있다.
createFolds(y, k = 10, list = TRUE, returnTrain = FALSE)
createFolds
를 사용하기 위해서는 입력 y의 타입이 vector type
이어야 한다.
그리고 chronological order
야만 한다.
# 10-fold cross validation with implementation
set.seed(12358)
folds <- createFolds(dfList_jemin[[2]]$app_name, k = 10, returnTrain = TRUE)
cv_results <- lapply(folds, function(x) {
training <- data.frame(df[x, ])
testing <- data.frame(df[-x, ])
classTraining <- class[x]
classtesting <- class[-x]
sms_model1 <- train(training, classTraining, method="nb")
credit_pred <- predict(sms_model1, testing)
cm1 <- confusionMatrix(credit_pred, classtesting, positive="TRUE")
return(cm1$overall[[1]])
})
str(cv_results)
mean(unlist(cv_results))
실행 결과
> str(cv_results)
List of 10
$ Fold01: num 0.886
$ Fold02: num 0.793
$ Fold03: num 0.788
$ Fold04: num 0.676
$ Fold05: num 0.5
$ Fold06: num 0.719
$ Fold07: num 0.688
$ Fold08: num 0.719
$ Fold09: num 0.788
$ Fold10: num 0.788
> mean(unlist(cv_results))
[1] 0.7343925
Caret control option을 이용한 방법: trControl=ctrl
caret
package의 control 함수를 이용해서도 위의 cross-validation 방법을 수행 할 수 있다.
# 10-fold cross validation with caret
ctrl <- trainControl(method="cv", 10, verbose = TRUE)
set.seed(12358)
sms_model1 <- train(df,class, method="nb", trControl=ctrl)
sms_model1
실행 결과
Naive Bayes
325 samples
1 predictor
2 classes: 'FALSE', 'TRUE'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 292, 293, 293, 292, 292, 292, ...
Resampling results across tuning parameters:
usekernel Accuracy Kappa Accuracy SD Kappa SD
FALSE 0.7321481 0.4587484 0.07483437 0.1433968
TRUE 0.7321481 0.4587484 0.07483437 0.1433968
Tuning parameter 'fL' was held constant at a value of 0
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were fL = 0 and usekernel = FALSE.
'AI > Machine Learning with R' 카테고리의 다른 글
Feature Selection with Caret (Auto) (0) | 2016.11.20 |
---|---|
Caret의 이해 (0) | 2016.03.06 |
Ensemble method: Bagging (bootstrap aggregating) (0) | 2015.11.19 |
Bootstrapping (0) | 2015.11.19 |
Principal Component Analysis (PCA) (0) | 2015.11.18 |