Naive Bayes with Caret package


이전 포스트에서 e1071 package로 구현 했던 것을 다시 caret package로 구현 한다.
caret package에는 많은 기능들이 포함되어 있으고 더 강력한것 같다.

Preliminary information

data set은 같은 것을 사용함.

  • 추가 package 사용 목록
  • caret package 사용
  • tm package 사용
  • pander package 사용: 출력을 보기좋게 표로 나타내줌.
  • doMC package: 멀티코어 병렬처리 가능. 하지만 내부적으로 Fork()를 이용해서 구현했기 때문에 Linux 환경에서만 동작 한다.
  • 몇가지 도우미 함수들 정의

Import Packages

# libraries needed by caret
library(klaR)
library(MASS)
# for the Naive Bayes modelling
library(caret)
# to process the text into a corpus
library(tm)
# to get nice looking tables
library(pander)
# to simplify selections
library(dplyr)

# doMC package use fork() internally. so, it it not working in Window OS.
#library(doMC)
#registerDoMC(cores=4)

# a utility function for % freq tables
frqtab <- function(x, caption) {
    round(100*prop.table(table(x)), 1)
}

# utility function to summarize model comparison results
sumpred <- function(cm) {
    summ <- list(TN=cm$table[1,1],  # true negatives
                 TP=cm$table[2,2],  # true positives
                 FN=cm$table[1,2],  # false negatives
                 FP=cm$table[2,1],  # false positives
                 acc=cm$overall["Accuracy"],  # accuracy
                 sens=cm$byClass["Sensitivity"],  # sensitivity
                 spec=cm$byClass["Specificity"])  # specificity
    lapply(summ, FUN=round, 2)
}

Reading and Preparing the data

다운로드 받아서 수행하는 것은 잘 안된다.

이전 POST의 e1071에 업로드한 csv 파일을 다운받아서 
데이터를 R object로 생성하는것은 똑같이 진행 한다.

# read the sms data into the sms data frame
sms_raw <- read.csv("sms_spam.csv", stringsAsFactors = FALSE)

sms_raw[1072,"text"] <- c("All done, all handed in. Don't know if mega shop in asda counts as celebration but thats what i'm doing!")
sms_raw[1072,"type"] <- c("ham")

# examine the structure of the sms data
str(sms_raw)

# convert spam/ham to factor.
sms_raw$type <- factor(sms_raw$type)

# examine the type variable more carefully
str(sms_raw$type)
table(sms_raw$type)

# build a corpus using the text mining (tm) package
library(tm)

# try to encode again to UTF-8
sms_raw$text <- iconv(enc2utf8(sms_raw$text),sub="byte")

sms_corpus <- Corpus(VectorSource(sms_raw$text))

# examine the sms corpus
print(sms_corpus)
inspect(sms_corpus[1:3])

Output

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 3

[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 49

[[2]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 23

[[3]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 43

Preparing the data

이전에 수행한 방식과 마찬가지로 data cleanup 및 transformation을 수행한다.
이때 dplyr 문법을 사용해서 작업을 수행 한다.
아래의 명령어를 수행하면 상용법을 알수 있다.

browseVignettes(package = "dplyr")
# ------------------------------------
# ------------------------------------
# We will proceed in similar fashion as described in the book, but make use of "dplyr"
# syntax to execute the text cleanup / transformation operations
sms_corpus <- Corpus(VectorSource(sms_raw$text))
sms_corpus_clean <- tm_map(sms_corpus, content_transformer(tolower))
sms_corpus_clean <- tm_map(sms_corpus_clean, removeNumbers)
sms_corpus_clean <- tm_map(sms_corpus_clean, removeWords, stopwords())
sms_corpus_clean <- tm_map(sms_corpus_clean, removePunctuation)
sms_corpus_clean <- tm_map(sms_corpus_clean, stripWhitespace)

sms_dtm <- DocumentTermMatrix(sms_corpus_clean)

Creating a Classification model with Naive Bayes

1. Generating the training and testing datasets

데이터를 분할하고
분할한 training set과 test set 모두에서 spam의 비율을 똑같이 설정 한다.
이전과 다른것은 Data 분할을 createDataPartition()을 이용해서 분할 한다는 것이다.

# -----------------------------------
# Creating a classificiation model with Naive Bayes
# Generating the training and testing datasets
train_index <- createDataPartition(sms_raw$type, p=0.75, list=FALSE)
sms_raw_train <- sms_raw[train_index,]
sms_raw_test <- sms_raw[-train_index,]
sms_corpus_clean_train <- sms_corpus_clean[train_index]
sms_corpus_clean_test <- sms_corpus_clean[-train_index]
sms_dtm_train <- sms_dtm[train_index,]
sms_dtm_test <- sms_dtm[-train_index,]

# check proportions in the testing and training sets 
ft_orig <- frqtab(sms_raw$type)
ft_train <- frqtab(sms_raw_train$type)
ft_test <- frqtab(sms_raw_test$type)
ft_df <- as.data.frame(cbind(ft_orig, ft_train, ft_test))
colnames(ft_df) <- c("Original", "Training set", "Test set")
pander(ft_df, style="rmarkdown",
       caption=paste0("Comparison of SMS type frequencies among datasets"))

|       |  Original  |  Training set  |  Test set  |
|:----------:|:----------:|:--------------:|:----------:|
|  **ham**   |    86.6    |      86.5      |    86.6    |
|  **spam**  |    13.4    |      13.5      |    13.4    |

Table: Comparison of SMS type frequencies among datasets

일정 이상 빈도인 것들만 추려내며
Absent 와 Present 이진데이터로 변경 한다.

# we will pick terms that appear at least 5 times in the training doucment term matrix.
sms_dict <- findFreqTerms(sms_dtm_train, lowfreq=5)
sms_train <- DocumentTermMatrix(sms_corpus_clean_train, list(dictionary=sms_dict))
sms_test <- DocumentTermMatrix(sms_corpus_clean_test, list(dictionary=sms_dict))
# modified sligtly fron the code in the book
convert_counts <- function(x) {
    x <- ifelse(x > 0, 1, 0)
    x <- factor(x, levels = c(0, 1), labels = c("Absent", "Present"))
}
sms_train <- sms_train %>% apply(MARGIN=2, FUN=convert_counts)
sms_test <- sms_test %>% apply(MARGIN=2, FUN=convert_counts)

2. Training the two prediction models

Naive Bayes 을 이용해서 기계학습을 수행 한다.
생성은 10-fold cross validation을 사용해서 한다.

그리고 두개의 모델을 각각 나눠서 수행 한다.
각각의 모델은 라플라스 추정자와 kernel density를 각각 적용한 것과 그렇지 않은것으로 구분 한다.

우선 Caret pakcage에서 Naive Bayes 모델에 대해서 지원하는 Control parameter의 종류는 아래와 같다.

> modelLookup("nb")
  model parameter              label forReg forClass probModel
1    nb        fL Laplace Correction  FALSE     TRUE      TRUE
2    nb usekernel  Distribution Type  FALSE     TRUE      TRU

위와 같이 modelLookup()명령어를 통해서 필요한 Machine Learning모델의 파라메터 지원의 모든 종류를 확인 할 수 있다.

아래는 라플라스 조정자를 적용하지 않은것이다. default 값을 의미한다.

ctrl <- trainControl(method="cv", 10)
set.seed(12358)
sms_model1 <- train(sms_train, sms_raw_train$type, method="nb", trControl=ctrl)
sms_model1

Naive Bayes 

4170 samples
1234 predictors
   2 classes: 'ham', 'spam' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 3753, 3753, 3754, 3753, 3753, 3753, ... 
Resampling results across tuning parameters:

  usekernel  Accuracy   Kappa      Accuracy SD  Kappa SD  
  FALSE      0.9798567  0.9098827  0.003606845  0.01680031
   TRUE      0.9798567  0.9098827  0.003606845  0.01680031

Tuning parameter 'fL' was held constant at a value of 0
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were fL = 0 and usekernel = FALSE.

아래는 라플라스 조정자를 1로 설정하고 kernel density는 적용하지 않았다.

set.seed(12358)
sms_model2 <- train(sms_train, sms_raw_train$type, method="nb", 
                    tuneGrid=data.frame(.fL=1, .usekernel=FALSE),
                    trControl=ctrl)

> sms_model2
Naive Bayes 

4170 samples
1234 predictors
   2 classes: 'ham', 'spam' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 3753, 3753, 3754, 3753, 3753, 3753, ... 
Resampling results

  Accuracy   Kappa      Accuracy SD  Kappa SD  
  0.9793765  0.9071502  0.004822776  0.02320774

Tuning parameter 'fL' was held constant at a value of 1
Tuning parameter 'usekernel' was held constant at a value
 of FALSE

Testing the predictions

두개의 모델을 confusionMatrix()을 이용해서 평가한다.
positive result의 목적은 message가 SPAM으로 prediction 됬을 때를 의미한다.

confusionMatrix()의 사용법은 positive label만 설정해주고 두개의 비교할 라벨 데이터만 넣어주면된다.

# Testing the predictions
sms_predict1 <- predict(sms_model1, sms_test)
cm1 <- confusionMatrix(sms_predict1, sms_raw_test$type, positive="spam")
cm1
Confusion Matrix and Statistics

          Reference
Prediction  ham spam
      ham  1199   25
      spam    4  161

               Accuracy : 0.9791         
                 95% CI : (0.9702, 0.986)
    No Information Rate : 0.8661         
    P-Value [Acc > NIR] : < 2.2e-16      

                  Kappa : 0.9055         
 Mcnemar's Test P-Value : 0.0002041      

            Sensitivity : 0.8656         
            Specificity : 0.9967         
         Pos Pred Value : 0.9758         
         Neg Pred Value : 0.9796         
             Prevalence : 0.1339         
         Detection Rate : 0.1159         
   Detection Prevalence : 0.1188         
      Balanced Accuracy : 0.9311         

       'Positive' Class : spam 

Laplace estimator 설정 kernel density를 적용하지 않은 모델의 경우

> sms_predict2 <- predict(sms_model2, sms_test)
There were 50 or more warnings (use warnings() to see the first 50)
> cm2 <- confusionMatrix(sms_predict2, sms_raw_test$type, positive="spam")
> cm2
Confusion Matrix and Statistics

          Reference
Prediction  ham spam
      ham  1200   27
      spam    3  159

               Accuracy : 0.9784          
                 95% CI : (0.9693, 0.9854)
    No Information Rate : 0.8661          
    P-Value [Acc > NIR] : < 2.2e-16       

                  Kappa : 0.9015          
 Mcnemar's Test P-Value : 2.679e-05       

            Sensitivity : 0.8548          
            Specificity : 0.9975          
         Pos Pred Value : 0.9815          
         Neg Pred Value : 0.9780          
             Prevalence : 0.1339          
         Detection Rate : 0.1145          
   Detection Prevalence : 0.1166          
      Balanced Accuracy : 0.9262          

       'Positive' Class : spam            

라플라스 추정자를 설정하지 않을경우 0 확률이 존재해서 데이터가 작다면 매우 부정확해 질 수 있다. 즉 어느 하나가 0%이기 때문에 다른것에 상관없이 무조건 0 probablity가 나와 버린다.

caret based Naive Bayes와 e1071 based Naive Bayes의 비교

TP, TN, FP, FN
accuracy
sensitivity (also known as recall or true positive rate)
specificity (also known as true negative rate)

> # from the table on page 115 of the book
> tn=1203
> tp=151
> fn=32
> fp=4
> book_example1 <- list(
+     TN=tn,
+     TP=tp,
+     FN=fn,
+     FP=fp,
+     acc=(tp + tn)/(tp + tn + fp + fn),
+     sens=tp/(tp + fn),
+     spec=tn/(tn + fp))
> 
> # from the table on page 116 of the book
> tn=1204
> tp=152
> fn=31
> fp=3
> book_example2 <- list(
+     TN=tn,
+     TP=tp,
+     FN=fn,
+     FP=fp,
+     acc=(tp + tn)/(tp + tn + fp + fn),
+     sens=tp/(tp + fn),
+     spec=tn/(tn + fp))
> 
> b1 <- lapply(book_example1, FUN=round, 2)
> b2 <- lapply(book_example2, FUN=round, 2)
> m1 <- sumpred(cm1)
> m2 <- sumpred(cm2)
> model_comp <- as.data.frame(rbind(b1, b2, m1, m2))
> rownames(model_comp) <- c("Book model 1", "Book model 2", "Caret model 1", "Caret model 2")
> pander(model_comp, style="rmarkdown", split.tables=Inf, keep.trailing.zeros=TRUE,
+        caption="Model results when comparing predictions and test set")


|                |  TN  |  TP  |  FN  |  FP  |  acc  |  sens  |  spec  |
|:-------------------:|:----:|:----:|:----:|:----:|:-----:|:------:|:------:|
|  **Book model 1**   | 1203 | 151  |  32  |  4   | 0.97  |  0.83  |   1    |
|  **Book model 2**   | 1204 | 152  |  31  |  3   | 0.98  |  0.83  |   1    |
|  **Caret model 1**  | 1199 | 161  |  25  |  4   | 0.98  |  0.87  |   1    |
|  **Caret model 2**  | 1200 | 159  |  27  |  3   | 0.98  |  0.85  |   1    |

Table: Model results when comparing predictions and test set

비교결과 accuracy에 대해서는 e1071와 caret 두 package는 별다른 차이가 없다.
하지만, sensitivity 측면에서는 차이가 있는 것을 알 수 있다.

추가내용 참조


+ Recent posts