Week 02: Predicting with Regression


  • Caret package
  • Data slicing
  • Training options
  • Plotting predictions
  • Basic preprocessing
  • Covariate creation
  • Preprocessing with principal components analysis
  • Predicting with Regression
  • Predicting with Regression Multiple Covariates



Predicting with Regression


선형적이면 구현하기 쉽고 이해가 쉽다.

하지만 비선형이면 정확도가 너무 떨어진다.


Example: old faithful eruptions

# ppredicting with regression
library(caret);data(faithful); set.seed(333)
inTrain <- createDataPartition(y=faithful$waiting,
                               p=0.5, list=FALSE)
trainFaith <- faithful[inTrain,]; testFaith <- faithful[-inTrain,]
head(trainFaith)

  eruptions waiting
1     3.600      79
3     3.333      74
5     4.533      85
6     2.883      55
7     4.700      88
8     3.600      85
# showing graph
plot(trainFaith$waiting,trainFaith$eruptions,pch=19,col="blue",xlab="Waiting",ylab="Duration")

데이터가 선형이므로 Linear Regression을 적용하면 좋다.


# making model
lm1 <- lm(eruptions ~ waiting,data=trainFaith)
summary(lm1)
Call:
lm(formula = eruptions ~ waiting, data = trainFaith)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.26990 -0.34789  0.03979  0.36589  1.05020 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.792739   0.227869  -7.867 1.04e-12 ***
waiting      0.073901   0.003148  23.474  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.495 on 135 degrees of freedom
Multiple R-squared:  0.8032,	Adjusted R-squared:  0.8018 
F-statistic:   551 on 1 and 135 DF,  p-value: < 2.2e-16
# model fit
plot(trainFaith$waiting,trainFaith$eruptions,pch=19,col="blue",xlab="Waiting",ylab="Duration")
lines(trainFaith$waiting,lm1$fitted,lwd=3)

만들어진 모델을 확인한다.

그리고 값을 prediction을 해보면 아래와 같다.

# predict a new value
coef(lm1)[1] + coef(lm1)[2]*80
(Intercept) 
   4.119307 
newdata <- data.frame(waiting=80)
predict(lm1,newdata)
       1 
4.119307 

테스트와 트레이이닝에 대해서 각각 그려보면 아래와 같다.

# plot predictions - training and test
par(mfrow=c(1,2))
plot(trainFaith$waiting,trainFaith$eruptions,pch=19,col="blue",xlab="Waiting",ylab="Duration")
lines(trainFaith$waiting,predict(lm1),lwd=3)
plot(testFaith$waiting,testFaith$eruptions,pch=19,col="blue",xlab="Waiting",ylab="Duration")
lines(testFaith$waiting,predict(lm1,newdata=testFaith),lwd=3)

각각의 모델의 에러를 계산해보면 아래와 같다. 계산 방식은 RMSE를 이용하며 트레이닝과 테스트에 대해서 각각 수행 했다.

> # Calculate RMSE on training
> sqrt(sum((lm1$fitted-trainFaith$eruptions)^2))
[1] 5.75186
> # Calculate RMSE on test
> sqrt(sum((predict(lm1,newdata=testFaith)-testFaith$eruptions)^2))
[1] 5.838559


이러한 구현을 caret을 이용해서 할경우 매우 쉽게 할 수 있다.

> #same process with caret
> modFit <- train(eruptions ~ waiting,data=trainFaith,method="lm")
> summary(modFit$finalModel)

Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.26990 -0.34789  0.03979  0.36589  1.05020 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.792739   0.227869  -7.867 1.04e-12 ***
waiting      0.073901   0.003148  23.474  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.495 on 135 degrees of freedom
Multiple R-squared:  0.8032,	Adjusted R-squared:  0.8018 
F-statistic:   551 on 1 and 135 DF,  p-value: < 2.2e-16





+ Recent posts