Week 02: Training options, Plotting predictions
- Caret package
- Data slicing
- Training options
- Plotting predictions
- Basic peprocessing
- Covariate creation
- Preprocessing with principal components analysis
- Predicting with Regression
- Predicting with Regression Multiple Covariates
Training options
이전에 트레이닝 옵션을 설정 했었다.
inTrain <- createDataPartition(y=spam$type,
p=0.75, list=FALSE)
training <- spam[inTrain,]
testing <- spam[-inTrain,]
modelFit <- train(type ~., data=training, method="glm")
하지만 많은 training option들이 존재한다.
> args(train.default)
function (x, y, method = "rf", preProcess = NULL, ..., weights = NULL,
metric = ifelse(is.factor(y), "Accuracy", "RMSE"), maximize = ifelse(metric %in%
c("RMSE", "logLoss"), FALSE, TRUE), trControl = trainControl(),
tuneGrid = NULL, tuneLength = 3)
NULL
preProcess에 관한 내용.
unbalance한 data sample이 있을경우 weight 값을 조정 할 수도 있다.
metric = ifelse(is.factor(y), "Accuracy", "RMSE") 이부분은 categorical data의 경우 해당 데이터를 이용한 모델이
maximized accuracy 일때까지 트레이닝을 수행하고
만약 continuous data의 경우 RMSE가 Maximized 될 때 까지 수행하게 된다.
trainControl도 설정 할 수 있다.
■ Metric options
Continuous outcomes:
- RMSE = Root mean squared error
- RSquared = $R^2$ from regression models
Categorical outcomes:
- Accuracy = Fraction correct
- Kappa = A measure of concordance
■ trainControl
> args(trainControl)
function (method = "boot", number = ifelse(grepl("cv", method),
10, 25), repeats = ifelse(grepl("cv", method), 1, number),
p = 0.75, search = "grid", initialWindow = NULL, horizon = 1,
fixedWindow = TRUE, verboseIter = FALSE, returnData = TRUE,
returnResamp = "final", savePredictions = FALSE, classProbs = FALSE,
summaryFunction = defaultSummary, selectionFunction = "best",
preProcOptions = list(thresh = 0.95, ICAcomp = 3, k = 5),
sampling = NULL, index = NULL, indexOut = NULL, timingSamps = 0,
predictionBounds = rep(FALSE, 2), seeds = NA, adaptive = list(min = 5,
alpha = 0.05, method = "gls", complete = TRUE), trim = FALSE,
allowParallel = TRUE)
NULL
trainControl resampling
method
- boot = bootstrapping
- boot632 = bootstrapping with adjustment
- cv = cross validation
- repeatedcv = repeated cross validation
- LOOCV = leave one out cross validation
number
- For boot/cross validation
- Number of subsamples to take
repeats
- Number of times to repeate sub-sampling
- If big this can slow things down
Setting the seed
It is often useful to set an overall seed
You can also set a seed for each resmaple
seeding each resample is useful for parallel fits
더 읽어볼 거리
Plotting predictions
머신 러닝 알고리즘을 적용하기 전에 우선적으로 데이터의 특성에 대해서 파악을 해야 한다.
데이터 특성을 파악하기 좋은 방법은 Graph를 그려서 그것을 확인 하는 것이다.
사용할 데이터는 임금에 대한 데이터를 사용 한다.
predicting wages
library(ISLR); library(ggplot2); library(caret);
data(Wage)
summary(Wage)
데이터 결과는 아래와 같다.
year age sex maritl
Min. :2003 Min. :18.00 1. Male :3000 1. Never Married: 648
1st Qu.:2004 1st Qu.:33.75 2. Female: 0 2. Married :2074
Median :2006 Median :42.00 3. Widowed : 19
Mean :2006 Mean :42.41 4. Divorced : 204
3rd Qu.:2008 3rd Qu.:51.00 5. Separated : 55
Max. :2009 Max. :80.00
race education region
1. White:2480 1. < HS Grad :268 2. Middle Atlantic :3000
2. Black: 293 2. HS Grad :971 1. New England : 0
3. Asian: 190 3. Some College :650 3. East North Central: 0
4. Other: 37 4. College Grad :685 4. West North Central: 0
5. Advanced Degree:426 5. South Atlantic : 0
6. East South Central: 0
(Other) : 0
jobclass health health_ins logwage
1. Industrial :1544 1. <=Good : 858 1. Yes:2083 Min. :3.000
2. Information:1456 2. >=Very Good:2142 2. No : 917 1st Qu.:4.447
Median :4.653
Mean :4.654
3rd Qu.:4.857
Max. :5.763
wage
Min. : 20.09
1st Qu.: 85.38
Median :104.92
Mean :111.70
3rd Qu.:128.68
Max. :318.34
위의 것을 그래프로 그려보자.
featurePlot()은 caret package에 있는 것이다.
Wrapper for Lattice Plotting of Predictor Variables
#drawing Feautre plot
featurePlot(x = training[,c("age","education","jobclass")],
y = training$wage,
plot="pairs")
재미있는 그래프 이다. wage에 대해서 즉 y값에 대해서 age, education, jobclass와의 관계를 한눈에 볼 수 있다.
다소 복잡해 보이지만 찬찬히 보면 한눈에 들어오는것을 알 수 있다.
'MOOC > Practical Machine Learning (r programing)' 카테고리의 다른 글
Week 02: Covariate creation, Pre-processing with principal components analysis (0) | 2015.11.15 |
---|---|
Week 02: Basic preprocessing, Covariate creation (0) | 2015.11.15 |
Week 02: Caret Package, dataSlicing (1) | 2015.11.13 |
Week 01: Receiver Operating Characteristic, Cross validation (0) | 2015.11.09 |
Week 01: Prediction study design, Types of errors (0) | 2015.11.08 |