Cross Validation


Ch.10 Evaluating Model Performance
k-fold cross-validation에 대해서 다룬다.

K는 통상 10을 선택 한다.
After the process of training and evaluating the model has occured for 10 times (with 10 different training/testing combinations), the average performance across all the folds is reported.

k가 너무 크면, leave-one-out method 방법이다. 이 방법은 training에 사용할 데이터가 너무 적을 때 사용하는 방법이다.

createFolds

아래의 함수를 이용해서 k-fold를 수행 할 수 있다.

createFolds(y, k = 10, list = TRUE, returnTrain = FALSE)

createFolds를 사용하기 위해서는 입력 y의 타입이 vector type이어야 한다.
그리고 chronological order야만 한다.

# 10-fold cross validation with implementation
set.seed(12358)
folds <- createFolds(dfList_jemin[[2]]$app_name, k = 10, returnTrain = TRUE)

cv_results <- lapply(folds, function(x) {
    training <- data.frame(df[x, ])
    testing <- data.frame(df[-x, ])

    classTraining <- class[x]
    classtesting <-  class[-x]


    sms_model1 <- train(training, classTraining, method="nb")

    credit_pred <- predict(sms_model1, testing)

    cm1 <- confusionMatrix(credit_pred, classtesting, positive="TRUE")

    return(cm1$overall[[1]])
})

str(cv_results)
mean(unlist(cv_results))

실행 결과

> str(cv_results)
List of 10
 $ Fold01: num 0.886
 $ Fold02: num 0.793
 $ Fold03: num 0.788
 $ Fold04: num 0.676
 $ Fold05: num 0.5
 $ Fold06: num 0.719
 $ Fold07: num 0.688
 $ Fold08: num 0.719
 $ Fold09: num 0.788
 $ Fold10: num 0.788
> mean(unlist(cv_results))
[1] 0.7343925

Caret control option을 이용한 방법: trControl=ctrl

caret package의 control 함수를 이용해서도 위의 cross-validation 방법을 수행 할 수 있다.

# 10-fold cross validation  with caret
ctrl <- trainControl(method="cv", 10, verbose = TRUE)

set.seed(12358)
sms_model1 <- train(df,class, method="nb", trControl=ctrl)
sms_model1

실행 결과

Naive Bayes 

325 samples
  1 predictor
  2 classes: 'FALSE', 'TRUE' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 292, 293, 293, 292, 292, 292, ... 
Resampling results across tuning parameters:

  usekernel  Accuracy   Kappa      Accuracy SD  Kappa SD 
  FALSE      0.7321481  0.4587484  0.07483437   0.1433968
   TRUE      0.7321481  0.4587484  0.07483437   0.1433968

Tuning parameter 'fL' was held constant at a value of 0
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were fL = 0 and usekernel = FALSE. 


'Data Science > Machine Learning with R' 카테고리의 다른 글

Feature Selection with Caret (Auto)  (0) 2016.11.20
Caret의 이해  (0) 2016.03.06
Cross Validation  (0) 2016.02.26
Ensemble method: Bagging (bootstrap aggregating)  (0) 2015.11.19
Bootstrapping  (0) 2015.11.19
Principal Component Analysis (PCA)  (0) 2015.11.18

+ Recent posts