Week 02: Predicting with Regression
- Caret package
- Data slicing
- Training options
- Plotting predictions
- Basic preprocessing
- Covariate creation
- Preprocessing with principal components analysis
- Predicting with Regression
- Predicting with Regression Multiple Covariates
Predicting with Regression
선형적이면 구현하기 쉽고 이해가 쉽다.
하지만 비선형이면 정확도가 너무 떨어진다.
Example: old faithful eruptions
# ppredicting with regression
library(caret);data(faithful); set.seed(333)
inTrain <- createDataPartition(y=faithful$waiting,
p=0.5, list=FALSE)
trainFaith <- faithful[inTrain,]; testFaith <- faithful[-inTrain,]
head(trainFaith)
eruptions waiting
1 3.600 79
3 3.333 74
5 4.533 85
6 2.883 55
7 4.700 88
8 3.600 85
# showing graph
plot(trainFaith$waiting,trainFaith$eruptions,pch=19,col="blue",xlab="Waiting",ylab="Duration")
데이터가 선형이므로 Linear Regression을 적용하면 좋다.
# making model
lm1 <- lm(eruptions ~ waiting,data=trainFaith)
summary(lm1)
Call:
lm(formula = eruptions ~ waiting, data = trainFaith)
Residuals:
Min 1Q Median 3Q Max
-1.26990 -0.34789 0.03979 0.36589 1.05020
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.792739 0.227869 -7.867 1.04e-12 ***
waiting 0.073901 0.003148 23.474 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.495 on 135 degrees of freedom
Multiple R-squared: 0.8032, Adjusted R-squared: 0.8018
F-statistic: 551 on 1 and 135 DF, p-value: < 2.2e-16
# model fit
plot(trainFaith$waiting,trainFaith$eruptions,pch=19,col="blue",xlab="Waiting",ylab="Duration")
lines(trainFaith$waiting,lm1$fitted,lwd=3)
만들어진 모델을 확인한다.
그리고 값을 prediction을 해보면 아래와 같다.
# predict a new value
coef(lm1)[1] + coef(lm1)[2]*80
(Intercept)
4.119307
newdata <- data.frame(waiting=80)
predict(lm1,newdata)
1
4.119307
테스트와 트레이이닝에 대해서 각각 그려보면 아래와 같다.
# plot predictions - training and test
par(mfrow=c(1,2))
plot(trainFaith$waiting,trainFaith$eruptions,pch=19,col="blue",xlab="Waiting",ylab="Duration")
lines(trainFaith$waiting,predict(lm1),lwd=3)
plot(testFaith$waiting,testFaith$eruptions,pch=19,col="blue",xlab="Waiting",ylab="Duration")
lines(testFaith$waiting,predict(lm1,newdata=testFaith),lwd=3)
각각의 모델의 에러를 계산해보면 아래와 같다. 계산 방식은 RMSE를 이용하며 트레이닝과 테스트에 대해서 각각 수행 했다.
> # Calculate RMSE on training
> sqrt(sum((lm1$fitted-trainFaith$eruptions)^2))
[1] 5.75186
> # Calculate RMSE on test
> sqrt(sum((predict(lm1,newdata=testFaith)-testFaith$eruptions)^2))
[1] 5.838559
이러한 구현을 caret을 이용해서 할경우 매우 쉽게 할 수 있다.
> #same process with caret
> modFit <- train(eruptions ~ waiting,data=trainFaith,method="lm")
> summary(modFit$finalModel)
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-1.26990 -0.34789 0.03979 0.36589 1.05020
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.792739 0.227869 -7.867 1.04e-12 ***
waiting 0.073901 0.003148 23.474 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.495 on 135 degrees of freedom
Multiple R-squared: 0.8032, Adjusted R-squared: 0.8018
F-statistic: 551 on 1 and 135 DF, p-value: < 2.2e-16
'MOOC > Practical Machine Learning (r programing)' 카테고리의 다른 글
Week 03: Predicting with trees (1) | 2015.11.19 |
---|---|
Week 02: Predicting with Regression Multiple Covariates (0) | 2015.11.16 |
Week 02: Covariate creation, Pre-processing with principal components analysis (0) | 2015.11.15 |
Week 02: Basic preprocessing, Covariate creation (0) | 2015.11.15 |
Week 02: Training options, Plotting predictions (0) | 2015.11.13 |