Week 02: Predicting with Regression

• Caret package
• Data slicing
• Training options
• Plotting predictions
• Basic preprocessing
• Covariate creation
• Preprocessing with principal components analysis
• Predicting with Regression
• Predicting with Regression Multiple Covariates

Predicting with Regression

선형적이면 구현하기 쉽고 이해가 쉽다.

하지만 비선형이면 정확도가 너무 떨어진다.

Example: old faithful eruptions

# ppredicting with regression
library(caret);data(faithful); set.seed(333)
inTrain <- createDataPartition(y=faithful$waiting, p=0.5, list=FALSE) trainFaith <- faithful[inTrain,]; testFaith <- faithful[-inTrain,] head(trainFaith) eruptions waiting 1 3.600 79 3 3.333 74 5 4.533 85 6 2.883 55 7 4.700 88 8 3.600 85 # showing graph plot(trainFaith$waiting,trainFaith$eruptions,pch=19,col="blue",xlab="Waiting",ylab="Duration") 데이터가 선형이므로 Linear Regression을 적용하면 좋다. # making model lm1 <- lm(eruptions ~ waiting,data=trainFaith) summary(lm1) Call: lm(formula = eruptions ~ waiting, data = trainFaith) Residuals: Min 1Q Median 3Q Max -1.26990 -0.34789 0.03979 0.36589 1.05020 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.792739 0.227869 -7.867 1.04e-12 *** waiting 0.073901 0.003148 23.474 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.495 on 135 degrees of freedom Multiple R-squared: 0.8032, Adjusted R-squared: 0.8018 F-statistic: 551 on 1 and 135 DF, p-value: < 2.2e-16 # model fit plot(trainFaith$waiting,trainFaith$eruptions,pch=19,col="blue",xlab="Waiting",ylab="Duration") lines(trainFaith$waiting,lm1$fitted,lwd=3) 만들어진 모델을 확인한다. 그리고 값을 prediction을 해보면 아래와 같다. # predict a new value coef(lm1)[1] + coef(lm1)[2]*80 (Intercept) 4.119307 newdata <- data.frame(waiting=80) predict(lm1,newdata) 1 4.119307  테스트와 트레이이닝에 대해서 각각 그려보면 아래와 같다. # plot predictions - training and test par(mfrow=c(1,2)) plot(trainFaith$waiting,trainFaith$eruptions,pch=19,col="blue",xlab="Waiting",ylab="Duration") lines(trainFaith$waiting,predict(lm1),lwd=3)
plot(testFaith$waiting,testFaith$eruptions,pch=19,col="blue",xlab="Waiting",ylab="Duration")
lines(testFaith$waiting,predict(lm1,newdata=testFaith),lwd=3) 각각의 모델의 에러를 계산해보면 아래와 같다. 계산 방식은 RMSE를 이용하며 트레이닝과 테스트에 대해서 각각 수행 했다. > # Calculate RMSE on training > sqrt(sum((lm1$fitted-trainFaith$eruptions)^2)) [1] 5.75186 > # Calculate RMSE on test > sqrt(sum((predict(lm1,newdata=testFaith)-testFaith$eruptions)^2))
[1] 5.838559

이러한 구현을 caret을 이용해서 할경우 매우 쉽게 할 수 있다.

> #same process with caret
> modFit <- train(eruptions ~ waiting,data=trainFaith,method="lm")
> summary(modFit\$finalModel)

Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
Min       1Q   Median       3Q      Max
-1.26990 -0.34789  0.03979  0.36589  1.05020

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.792739   0.227869  -7.867 1.04e-12 ***
waiting      0.073901   0.003148  23.474  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.495 on 135 degrees of freedom
Multiple R-squared:  0.8032,	Adjusted R-squared:  0.8018
F-statistic:   551 on 1 and 135 DF,  p-value: < 2.2e-16

#### 'MOOC > Practical Machine Learning (r programing)' 카테고리의 다른 글

 Week 03: Predicting with trees  (1) 2015.11.19 2015.11.16 2015.11.16 2015.11.15 2015.11.15 2015.11.13