Week 02: Training options, Plotting predictions

JAYNUX 2015. 11. 13. 23:10

2015. 11. 13. 23:10

Week 02: Training options, Plotting predictions

Caret package
Data slicing
Training options
Plotting predictions
Basic peprocessing
Covariate creation
Preprocessing with principal components analysis
Predicting with Regression
Predicting with Regression Multiple Covariates

Training options

이전에 트레이닝 옵션을 설정 했었다.

inTrain <- createDataPartition(y=spam$type,

p=0.75, list=FALSE)

training <- spam[inTrain,]

testing <- spam[-inTrain,]

modelFit <- train(type ~., data=training, method="glm")

하지만 많은 training option들이 존재한다.

> args(train.default)

function (x, y, method = "rf", preProcess = NULL, ..., weights = NULL,

metric = ifelse(is.factor(y), "Accuracy", "RMSE"), maximize = ifelse(metric %in%

c("RMSE", "logLoss"), FALSE, TRUE), trControl = trainControl(),

tuneGrid = NULL, tuneLength = 3)

NULL

preProcess에 관한 내용.

unbalance한 data sample이 있을경우 weight 값을 조정 할 수도 있다.

metric = ifelse(is.factor(y), "Accuracy", "RMSE") 이부분은 categorical data의 경우 해당 데이터를 이용한 모델이

maximized accuracy 일때까지 트레이닝을 수행하고

만약 continuous data의 경우 RMSE가 Maximized 될 때 까지 수행하게 된다.

trainControl도 설정 할 수 있다.

■ Metric options

Continuous outcomes:

RMSE = Root mean squared error
RSquared = $R^2$ from regression models

Categorical outcomes:

Accuracy = Fraction correct
Kappa = A measure of concordance

■ trainControl

> args(trainControl)

function (method = "boot", number = ifelse(grepl("cv", method),

10, 25), repeats = ifelse(grepl("cv", method), 1, number),

p = 0.75, search = "grid", initialWindow = NULL, horizon = 1,

fixedWindow = TRUE, verboseIter = FALSE, returnData = TRUE,

returnResamp = "final", savePredictions = FALSE, classProbs = FALSE,

summaryFunction = defaultSummary, selectionFunction = "best",

preProcOptions = list(thresh = 0.95, ICAcomp = 3, k = 5),

sampling = NULL, index = NULL, indexOut = NULL, timingSamps = 0,

predictionBounds = rep(FALSE, 2), seeds = NA, adaptive = list(min = 5,

alpha = 0.05, method = "gls", complete = TRUE), trim = FALSE,

allowParallel = TRUE)

NULL

trainControl resampling

method

- boot = bootstrapping

- boot632 = bootstrapping with adjustment

- cv = cross validation

- repeatedcv = repeated cross validation

- LOOCV = leave one out cross validation

number

- For boot/cross validation

- Number of subsamples to take

repeats

- Number of times to repeate sub-sampling

- If big this can slow things down

Setting the seed

It is often useful to set an overall seed

You can also set a seed for each resmaple

seeding each resample is useful for parallel fits

더 읽어볼 거리

Model training and tuning

Plotting predictions

머신 러닝 알고리즘을 적용하기 전에 우선적으로 데이터의 특성에 대해서 파악을 해야 한다.

데이터 특성을 파악하기 좋은 방법은 Graph를 그려서 그것을 확인 하는 것이다.

사용할 데이터는 임금에 대한 데이터를 사용 한다.

predicting wages

library(ISLR); library(ggplot2); library(caret);
data(Wage)
summary(Wage)

데이터 결과는 아래와 같다.

      year           age               sex                    maritl    
 Min.   :2003   Min.   :18.00   1. Male  :3000   1. Never Married: 648  
 1st Qu.:2004   1st Qu.:33.75   2. Female:   0   2. Married      :2074  
 Median :2006   Median :42.00                    3. Widowed      :  19  
 Mean   :2006   Mean   :42.41                    4. Divorced     : 204  
 3rd Qu.:2008   3rd Qu.:51.00                    5. Separated    :  55  
 Max.   :2009   Max.   :80.00                                           
                                                                        
       race                   education                     region    
 1. White:2480   1. < HS Grad      :268   2. Middle Atlantic   :3000  
 2. Black: 293   2. HS Grad        :971   1. New England       :   0  
 3. Asian: 190   3. Some College   :650   3. East North Central:   0  
 4. Other:  37   4. College Grad   :685   4. West North Central:   0  
                 5. Advanced Degree:426   5. South Atlantic    :   0  
                                          6. East South Central:   0  
                                          (Other)              :   0  
           jobclass               health      health_ins      logwage     
 1. Industrial :1544   1. <=Good     : 858   1. Yes:2083   Min.   :3.000  
 2. Information:1456   2. >=Very Good:2142   2. No : 917   1st Qu.:4.447  
                                                           Median :4.653  
                                                           Mean   :4.654  
                                                           3rd Qu.:4.857  
                                                           Max.   :5.763  
                                                                          
      wage       
 Min.   : 20.09  
 1st Qu.: 85.38  
 Median :104.92  
 Mean   :111.70  
 3rd Qu.:128.68  
 Max.   :318.34

위의 것을 그래프로 그려보자.

featurePlot()은 caret package에 있는 것이다.

Wrapper for Lattice Plotting of Predictor Variables

#drawing Feautre plot
featurePlot(x = training[,c("age","education","jobclass")],
            y = training$wage,
            plot="pairs")

재미있는 그래프 이다. wage에 대해서 즉 y값에 대해서 age, education, jobclass와의 관계를 한눈에 볼 수 있다.

다소 복잡해 보이지만 찬찬히 보면 한눈에 들어오는것을 알 수 있다.

저작자표시 비영리 변경금지

'MOOC > Practical Machine Learning (r programing)' 카테고리의 다른 글

Week 02: Covariate creation, Pre-processing with principal components analysis (0)	2015.11.15
Week 02: Basic preprocessing, Covariate creation (0)	2015.11.15
Week 02: Caret Package, dataSlicing (1)	2015.11.13
Week 01: Receiver Operating Characteristic, Cross validation (0)	2015.11.09
Week 01: Prediction study design, Types of errors (0)	2015.11.08

GOOD to GREAT

Week 02: Training options, Plotting predictions

'MOOC > Practical Machine Learning (r programing)' 카테고리의 다른 글

+ Recent posts

티스토리툴바