Naive Baeys vs Neural network


Naive Bayes와 Deep Neural Networks에서 어느 알고리즘이 더 우수한지 평가한다.

한 사용자의 응답성을 나타내는 데이터이다.
1276개의 응답성 지표를 가지고 분석 했다.
1276

응답 비율은 아래와 같다.
(FALSE): 922
(TRUE): 354

해당 데이터를 70%는 트레이닝 30%는 테스팅으로 분할 한다.
임의로 분할 하기 때문에 모든 요일과 시간이 적절히 섞인다.

요청이 있어서 데이터의 일부분을 Google Drive로 공유합니다.
단순히 형태만 참고 하시면 됩니다.
앞의 8개는 입력 Feature 이고 마지막 class가 출력값 입니다.
응답성은 True와 False값을 가지므로 Binary classification문제가 됩니다.

Naive Bayes Classifier in R##

데이터 구조

> head(training)
1        1    74          18               4                  2         2        1                 1
2        2    10          18               4                  2         2        3                 2
4        1    56          19               4                  2         2        1                 1
6        1    84          19               4                  1         1        1                 4
8        1    39          19               4                  1         2        1                 4
9        1    56          19               4                  1         2        1                 4
library(caret)
set.seed(12358)
inTrain <- createDataPartition(y=factorClassList[['ikhee']], p=0.70, list =FALSE)
training <- data.frame(dfList[['ikhee']][inTrain,])
testing <-  data.frame(dfList[['ikhee']][-inTrain,])

classTraining <- factorClassList[['ikhee']][inTrain]
classtesting <-  factorClassList[['ikhee']][-inTrain]

sms_model1 <- train(training,classTraining, method="nb", trControl = ctrl, tuneGrid = 
                        data.frame(.fL=c(0,0,1,1,10,10), .usekernel=c(FALSE,TRUE,FALSE,TRUE,FALSE,TRUE)))
sms_model1

sms_predict1 <- predict(sms_model1, testing)
cm1 <- confusionMatrix(sms_predict1, classtesting, positive="TRUE")
cm1

결과

> cm1
Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE   273  102
     TRUE      3    4
                                          
               Accuracy : 0.7251          
                 95% CI : (0.6774, 0.7693)
    No Information Rate : 0.7225          
    P-Value [Acc > NIR] : 0.4806          
                                          
                  Kappa : 0.0377          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.03774         
            Specificity : 0.98913         
         Pos Pred Value : 0.57143         
         Neg Pred Value : 0.72800         
             Prevalence : 0.27749         
         Detection Rate : 0.01047         
   Detection Prevalence : 0.01832         
      Balanced Accuracy : 0.51343         
                                          
       'Positive' Class : TRUE  

실제값

> table(classtesting)
classtesting
FALSE  TRUE 
  276   106 

Deep Neural Networks with TensorFlow in Python##

three layers neural networks. Activation function is Sigmoid (logistic)

데이터구조
Neural network의 경우 학습을 위해서
모든 Feature와 Class를 0~1 사이로 normalization 해야 cost function이 convergence 된다.
그렇지 않으면 발산 한다.

> head(ikheeTrainingDf_norm)
1   0.0000 0.56153846   0.7391304             0.5                  1         1        0               0.0  1
2   0.0625 0.06923077   0.7391304             0.5                  1         1        1               0.2  1
3   0.0000 0.42307692   0.7826087             0.5                  1         1        0               0.0  1
4   0.0000 0.63846154   0.7826087             0.5                  0         0        0               0.6  1
5   0.0000 0.29230769   0.7826087             0.5                  0         1        0               0.6  0
6   0.0000 0.42307692   0.7826087             0.5                  0         1        0               0.6  1

이러한 데이터를 txt로 export해서
python으로 다시 처리한다.

import tensorflow as tf
import numpy as np
from sklearn.metrics import precision_score, confusion_matrix
from sklearn.metrics import classification_report
import pandas as pd

# three layers neural networks. Activation function is Sigmoid (logistic)

xyTraining = np.loadtxt('ikheeTrainingNorm.txt', unpack=True)
xyTesting = np.loadtxt('ikheeTestingNorm.txt', unpack=True)

x_data_training = np.transpose(xyTraining[0:-1])
y_data_training = np.reshape(xyTraining[-1], (len(x_data_training), 1))

x_data_testing = np.transpose(xyTesting[0:-1])
y_data_testing = np.reshape(xyTesting[-1], (len(x_data_testing), 1))

X = tf.placeholder(tf.float32, name='X-input')
Y = tf.placeholder(tf.float32, name='Y-input')

W1 = tf.Variable(tf.random_uniform([8, 16], -1.0, 1.0), name='Weight1')
W2 = tf.Variable(tf.random_uniform([16, 8], -1.0, 1.0), name='Weight2')
W3 = tf.Variable(tf.random_uniform([8, 1], -1.0, 1.0), name='Weight3')

b1 = tf.Variable(tf.zeros([16]), name="Bias1")
b2 = tf.Variable(tf.zeros([8]), name="Bias2")
b3 = tf.Variable(tf.zeros([1]), name="Bias3")


# Our hypothesis
with tf.name_scope("layer2") as scope:
    L2 = tf.sigmoid(tf.matmul(X, W1) + b1)

with tf.name_scope("layer3") as scope:
    L3 = tf.sigmoid(tf.matmul(L2, W2) + b2)

with tf.name_scope("layer4") as scope:
    hypothesis = tf.sigmoid(tf.matmul(L3, W3) + b3)

# Cost function
with tf.name_scope("cost") as scope:
    cost = -tf.reduce_mean(Y*tf.log(hypothesis) + (1-Y)*tf.log(1-hypothesis))
    cost_summ = tf.scalar_summary("cost", cost)

# Minimize
with tf.name_scope("train") as scope:
    a = tf.Variable(0.01) # Learning rate, alpha
    optimizer = tf.train.GradientDescentOptimizer(a)
    train = optimizer.minimize(cost)

# Add histogram
w1_hist = tf.histogram_summary("weights1", W1)
w2_hist = tf.histogram_summary("weights2", W2)

b1_hist = tf.histogram_summary("biases1", b1)
b2_hist = tf.histogram_summary("biases2", b2)

y_hist = tf.histogram_summary("y", Y)


# Before starting, initialize the variables.
# We will `run` this first.
init = tf.initialize_all_variables()


# Launch the graph,
with tf.Session() as sess:
    # tensorboard --logdir=./logs/xor_logs
    merged = tf.merge_all_summaries()
    writer = tf.train.SummaryWriter("./logs/xor_logs", sess.graph_def)

    sess.run(init)
    # Fit the line.
    for step in xrange(2000):
        sess.run(train, feed_dict={X:x_data_training, Y:y_data_training})
        if step % 200 == 0:
            summary = sess.run(merged, feed_dict={X:x_data_training, Y:y_data_training})
            writer.add_summary(summary, step)
            #print step, sess.run(cost, feed_dict={X:x_data, Y:y_data}), sess.run(W1), sess.run(W2)
            print step, sess.run(cost, feed_dict={X:x_data_training, Y:y_data_training})

    # Test model
    correct_prediction = tf.equal(tf.floor(hypothesis+0.5), Y)
    # Calculate accuracy
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))

    y_data_pred = sess.run(tf.floor(hypothesis + 0.5),
                           feed_dict={X: x_data_testing, Y: y_data_testing})

    print sess.run([hypothesis, tf.floor(hypothesis+0.5), correct_prediction, accuracy], feed_dict={X:x_data_testing, Y:y_data_testing})
    print "Accuracy:", accuracy.eval({X:x_data_testing, Y:y_data_testing})

#   print confusion_matrix(y_data_testing[:,0], y_data_pred[:,0], labels=[0, 1])
    pd_y_true = pd.Series(y_data_testing[:, 0])
    pd_x_pred = pd.Series(y_data_pred[:, 0])
    print pd.crosstab(pd_y_true, pd_x_pred, rownames=['True'], colnames=['Predicted'], margins=True)

    target_names = ['false', 'true']
    print(classification_report(y_data_testing[:, 0], y_data_pred[:, 0], target_names=target_names))
    print 'Precision', precision_score(y_data_testing[:, 0], y_data_pred[:, 0], average='binary',pos_label=1)

결과

# Iteration & Cost
0 0.594439
200 0.59129
400 0.591194
600 0.591135
800 0.591078
1000 0.59102
1200 0.590963
1400 0.590907
1600 0.590852
1800 0.590796

Accuracy: 0.722513

# Confusion Matrix
Predicted  0.0  All
0.0        276  276
1.0        106  106
All        382  382


# Precision & Recall
             precision    recall  f1-score   support
      false       0.72      1.00      0.84       276
       true       0.00      0.00      0.00       106

avg / total       0.52      0.72      0.61       382

11 layers neural networks.
Activation functions are LeRu and Sigmoid (logistic)

코드

import tensorflow as tf
import numpy as np
from sklearn.metrics import precision_score, confusion_matrix
from sklearn.metrics import classification_report
import pandas as pd

# three layers neural networks. Activation function is Sigmoid (logistic)

xyTraining = np.loadtxt('ikheeTrainingNorm.txt', unpack=True)
xyTesting = np.loadtxt('ikheeTestingNorm.txt', unpack=True)

x_data_training = np.transpose(xyTraining[0:-1])
y_data_training = np.reshape(xyTraining[-1], (len(x_data_training), 1))

x_data_testing = np.transpose(xyTesting[0:-1])
y_data_testing = np.reshape(xyTesting[-1], (len(x_data_testing), 1))

X = tf.placeholder(tf.float32, name='X-input')
Y = tf.placeholder(tf.float32, name='Y-input')

W1 = tf.Variable(tf.random_uniform([8, 8], -1.0, 1.0), name='Weight1')
# 9 hidden layers
W2 = tf.Variable(tf.random_uniform([8, 8], -1.0, 1.0), name='Weight2')
W3 = tf.Variable(tf.random_uniform([8, 8], -1.0, 1.0), name='Weight3')
W4 = tf.Variable(tf.random_uniform([8, 8], -1.0, 1.0), name='Weight4')
W5 = tf.Variable(tf.random_uniform([8, 8], -1.0, 1.0), name='Weight5')
W6 = tf.Variable(tf.random_uniform([8, 8], -1.0, 1.0), name='Weight6')
W7 = tf.Variable(tf.random_uniform([8, 8], -1.0, 1.0), name='Weight7')
W8 = tf.Variable(tf.random_uniform([8, 8], -1.0, 1.0), name='Weight8')
W9 = tf.Variable(tf.random_uniform([8, 8], -1.0, 1.0), name='Weight9')
W10 = tf.Variable(tf.random_uniform([8, 8], -1.0, 1.0), name='Weight10')

W11 = tf.Variable(tf.random_uniform([8, 1], -1.0, 1.0), name='Weight11')

b1 = tf.Variable(tf.zeros([8]), name="Bias1")
b2 = tf.Variable(tf.zeros([8]), name="Bias2")
b3 = tf.Variable(tf.zeros([8]), name="Bias3")
b4 = tf.Variable(tf.zeros([8]), name="Bias4")
b5 = tf.Variable(tf.zeros([8]), name="Bias5")
b6 = tf.Variable(tf.zeros([8]), name="Bias6")
b7 = tf.Variable(tf.zeros([8]), name="Bias7")
b8 = tf.Variable(tf.zeros([8]), name="Bias8")
b9 = tf.Variable(tf.zeros([8]), name="Bias9")
b10 = tf.Variable(tf.zeros([8]), name="Bias10")

b11 = tf.Variable(tf.zeros([1]), name="Bias11")


# Our hypothesis
with tf.name_scope("layer1") as scope:
    L1 = tf.nn.relu(tf.matmul(X, W1) + b1)
with tf.name_scope("layer2") as scope:
    L2 = tf.nn.relu(tf.matmul(L1, W2) + b2)
with tf.name_scope("layer3") as scope:
    L3 = tf.nn.relu(tf.matmul(L2, W3) + b3)
with tf.name_scope("layer4") as scope:
    L4 = tf.nn.relu(tf.matmul(L3, W4) + b4)
with tf.name_scope("layer5") as scope:
    L5 = tf.nn.relu(tf.matmul(L4, W5) + b5)
with tf.name_scope("layer6") as scope:
    L6 = tf.nn.relu(tf.matmul(L5, W6) + b6)
with tf.name_scope("layer7") as scope:
    L7 = tf.nn.relu(tf.matmul(L6, W7) + b7)
with tf.name_scope("layer8") as scope:
    L8 = tf.nn.relu(tf.matmul(L7, W8) + b8)
with tf.name_scope("layer9") as scope:
    L9 = tf.nn.relu(tf.matmul(L8, W9) + b9)
with tf.name_scope("layer10") as scope:
    L10 = tf.nn.relu(tf.matmul(L9, W10) + b10)
with tf.name_scope("last") as scope:
    hypothesis = tf.sigmoid(tf.matmul(L10, W11) + b11)


# Cost function
with tf.name_scope("cost") as scope:
    cost = -tf.reduce_mean(Y*tf.log(hypothesis) + (1-Y)*tf.log(1-hypothesis))
    cost_summ = tf.scalar_summary("cost", cost)

# Minimize
with tf.name_scope("train") as scope:
    a = tf.Variable(0.001) # Learning rate, alpha
    optimizer = tf.train.GradientDescentOptimizer(a)
    train = optimizer.minimize(cost)



# Before starting, initialize the variables.
# We will `run` this first.
init = tf.initialize_all_variables()


# Launch the graph,
with tf.Session() as sess:
    # tensorboard --logdir=./logs/xor_logs
    merged = tf.merge_all_summaries()
    writer = tf.train.SummaryWriter("./logs/xor_logs", sess.graph_def)

    sess.run(init)
    # Fit the line.
    for step in xrange(50000):
        sess.run(train, feed_dict={X:x_data_training, Y:y_data_training})
        if step % 2000 == 0:
            summary = sess.run(merged, feed_dict={X:x_data_training, Y:y_data_training})
            writer.add_summary(summary, step)
            #print step, sess.run(cost, feed_dict={X:x_data, Y:y_data}), sess.run(W1), sess.run(W2)
            print step, sess.run(cost, feed_dict={X:x_data_training, Y:y_data_training})

    # Test model
    correct_prediction = tf.equal(tf.floor(hypothesis+0.5), Y)
    # Calculate accuracy
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))

    y_data_pred = sess.run(tf.floor(hypothesis + 0.5),
                           feed_dict={X: x_data_testing, Y: y_data_testing})

    print sess.run([hypothesis, tf.floor(hypothesis+0.5), correct_prediction, accuracy], feed_dict={X:x_data_testing, Y:y_data_testing})
    print "Accuracy:", accuracy.eval({X:x_data_testing, Y:y_data_testing})

#   print confusion_matrix(y_data_testing[:,0], y_data_pred[:,0], labels=[0, 1])
    pd_y_true = pd.Series(y_data_testing[:, 0])
    pd_x_pred = pd.Series(y_data_pred[:, 0])
    print pd.crosstab(pd_y_true, pd_x_pred, rownames=['True'], colnames=['Predicted'], margins=True)

    target_names = ['false', 'true']
    print(classification_report(y_data_testing[:, 0], y_data_pred[:, 0], target_names=target_names))
    print 'Precision', precision_score(y_data_testing[:, 0], y_data_pred[:, 0], average='binary',pos_label=1)

결과

/root/tensorflow/bin/python /root/PycharmProjects/TensorFlowTest/PASSwithDNNLeRu9Hidden.py
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 8
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8
0 0.770699
2000 0.589376
4000 0.580235
6000 0.578699
8000 0.577574
10000 0.576372
12000 0.575388
14000 0.574309
16000 0.572363
18000 0.570983
20000 0.569931
22000 0.568943
24000 0.567569
26000 0.565458
28000 0.564114
30000 0.562682
32000 0.561554
34000 0.56046
36000 0.559264
38000 0.558028
40000 0.556391
42000 0.555027
44000 0.553637
46000 0.55207
48000 0.550296


Accuracy: 0.727749
Predicted  0.0  1.0  All
True                    
0.0        276    0  276
1.0        104    2  106
All        380    2  382
             precision    recall  f1-score   support

      false       0.73      1.00      0.84       276
       true       1.00      0.02      0.04       106

avg / total       0.80      0.73      0.62       382

Precision 1.0

Conclusion

데이터 자체의 Label즉 truefalse가 부정확하기 때문에

Garbage in Garbage out의 명제대로 그다지 차이가 없다.
하지만 precision 정확도가 Deep Neural Network의 경우 매우 우수하기 때문에
False Prediction이 치명적인 시스템에서는 유효하다고 볼 수있다.

좀더 Deep Neural Network을
DroupoutAda Optimizer초기 Weight설정 등을 통해서 향상 시킬 수 있을것 같다.


'Data Science > TensorFlow and Scikit-Learn' 카테고리의 다른 글

TensorFlow 기본 개념 (1)  (0) 2016.07.01
TensorFlow 버전 업데이트 (Version Update)  (4) 2016.06.15
Naive Bayes vs Neural network  (2) 2016.04.25
Softmax Function  (0) 2016.04.19
Neural Networks in XOR problem  (0) 2016.04.18
Logistic Regression  (0) 2016.04.17
  1. 지니 2017.01.03 15:26 신고

    안녕하세요.
    올려주신 예제가 학습하는데 많은 도움이 되고 있습니다.

    ikheeTrainingNorm.txt
    ikheeTestingNorm.txt
    이 두개도 공유 가능할까요?

    어떤 데이터가 들어 왔는지 궁굼해요.~~
    데이터 라벨과 형식이 적혀 있으면 더 도움이 많이 될거 같습니다.
    감사합니다.!

    • JAYNUX 2017.01.04 11:04 신고

      개인 정보 데이터인지라 자세히 내용을 알려드리긴 힘들것 같습니다.

      어떤 형식인지 알려드리기 위해서
      우선 해당 데이터를 일부분 추출해서 google drive로 공유해 놓았습니다.

      위에 링크가 있으니 그것을 통해서 다운 받으시면 됩니다.

      원하시는 응용에 해당 데이터 처럼 정규화 하신후에 적용하시면 Neural Net이 동작 할 것으로 생각합니다.

+ Recent posts