Random Forest

특징
Bootstrap sample
At each split, bootstrap varaibles
Grow multiple trees and vote

Pros:
Accuracy
그래서 Kagger 대회에서 많이 쓰인다고 한다.

Cons:
느린 트레이닝 시간
해석의 어려움
Overfitting
엄청나게 큰 트리를 만든다. 그리고 각각의 트리는 bootstrapping sample에 의존하게 된다.

각각의 트리에서 약간의 서로다름 때문에 약간의 다른 결과를 각각 얻게 된다.

그다음 모든 결과를 평균화 하게 된다. 그렇게 함으로써 각각의 class에대한 확률 값을 얻게 된다.

어떻게 동작하는지 아래의 에제를 통해서 확인해 보자.

Iris data

data(iris); library(ggplot2); library(caret)
## Loading required package: lattice
inTrain <- createDataPartition(y=iris$Species,
                              p=0.7, list=FALSE)
training <- iris[inTrain,]
testing <- iris[-inTrain,]

Random Forest를 수행한다. 예측할 class 값은 species이고 그것을 위해서 사용되는 feature들은 모든 것들이다.

modFit <- train(Species~ .,data=training,method="rf",prox=TRUE)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
modFit
## Random Forest 
## 
## 105 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 105, 105, 105, 105, 105, 105, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD  
##   2     0.9517675  0.9265018  0.02291581   0.03478529
##   3     0.9528615  0.9280510  0.02219341   0.03404748
##   4     0.9519080  0.9265991  0.02369344   0.03623696
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 3.

총 25번의 bootstrap sampling을 한것을 알 수 있다.

Getting a single tree tree의 모습을 살펴보면 아래와 같다. k=2라고 주면 second tree를 보겠다는 의미이다.

getTree(modFit$finalModel,k=2)
##   left daughter right daughter split var split point status prediction
## 1             2              3         4        0.80      1          0
## 2             0              0         0        0.00     -1          1
## 3             4              5         3        5.05      1          0
## 4             6              7         4        1.75      1          0
## 5             0              0         0        0.00     -1          3
## 6             0              0         0        0.00     -1          2
## 7             8              9         3        4.85      1          0
## 8             0              0         0        0.00     -1          2
## 9             0              0         0        0.00     -1          3

각각의 은 particular split를 의미한다.

Class “centers”

irisP <- classCenter(training[,c(3,4)], training$Species, modFit$finalModel$prox)
irisP <- as.data.frame(irisP); irisP$Species <- rownames(irisP)
p <- qplot(Petal.Width, Petal.Length, col=Species,data=training)
p + geom_point(aes(x=Petal.Width,y=Petal.Length,col=Species),size=5,shape=4,data=irisP)

Predicting nw values we just missed two values with random forest model. Overally, it is highly accurate in the prediction.

pred <- predict(modFit,testing); testing$predRight <- pred==testing$Species
table(pred,testing$Species)
##             
## pred         setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         14         2
##   virginica       0          1        13

We can then look and see which of the two that I missed.

qplot(Petal.Width,Petal.Length,colour=predRight,data=testing,main="newdata Predictions")

 After making a model, we should check where our prediction is doing well and where our prediction is doing poorly.

Summary

Random Forests Alogirhtm is top performing method along with bootsting.

They are often dificult to interpret beacuse of these multiple tress that we are ftting but they can be very accurate for a wide range of problems.


'MOOC > Practical Machine Learning (r programing)' 카테고리의 다른 글

Week03: Model based prediction  (0) 2015.11.23
Week03: Boosting  (0) 2015.11.23
Week 03: Random Forests  (0) 2015.11.23
Week 03: Bagging  (0) 2015.11.19
Week 03: Predicting with trees  (1) 2015.11.19
Week 02: Predicting with Regression Multiple Covariates  (0) 2015.11.16

+ Recent posts