Basic Idea

  1. Fit a regression model
  2. Penalize
    Pros:
  • Can help with the bias / variance trade-off
    Cons:
  • May be computationally demanding on large data sets
  • Does not perform as well as random forests and boosting

The bias is error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting). The variance is error from sensitivity to small fluctuations in the training set. High variance can cause overfitting: modeling the random noise in the training data, rather than the intended outputs.

A motivating example


where $X1$ and $X2$ are nearly perfectly correlated (co-linear). We can approximate this model by:
$$ Y = \beta_0 + \beta_1X_1+\beta_2X_2+\varepsilon $$

$$ Y = \beta_0 + (\beta_1+\beta_2X_1)+\varepsilon $$
These two functions are not exactly same beacuse we choose to leave one of the predictors out but we can avoid overffting.

Y=β0+β1X1+β2X2+ε
Prostate cancer
library(ElemStatLearn); data(prostate)
str(prostate)
## 'data.frame':    97 obs. of  10 variables:
##  $ lcavol : num  -0.58 -0.994 -0.511 -1.204 0.751 ...
##  $ lweight: num  2.77 3.32 2.69 3.28 3.43 ...
##  $ age    : int  50 58 74 58 62 50 64 58 47 63 ...
##  $ lbph   : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ svi    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lcp    : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ gleason: int  6 6 7 6 6 6 6 6 6 6 ...
##  $ pgg45  : int  0 0 20 0 0 0 0 0 0 0 ...
##  $ lpsa   : num  -0.431 -0.163 -0.163 -0.163 0.372 ...
##  $ train  : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...


Model selection approach: split samples

  • No method better when data/computation time permits it
  • Approach
    1. Divide data into training/test/validation
    2. Treat validation as test data, train all competing models on the train data and pick the best one on validation.
    3. To appropriately assess performance on new data apply to test set
    4. You may re-split and reperform steps 1-3
    5. Two common problems
  • Limited data
    • Computational complexity


'MOOC > Practical Machine Learning (r programing)' 카테고리의 다른 글

Term Project  (0) 2016.04.07
Certification and Comments  (0) 2015.12.04
Week 04: Regularized Regression  (0) 2015.11.24
Week03: Model based prediction  (0) 2015.11.23
Week03: Boosting  (0) 2015.11.23
Week 03: Random Forests  (0) 2015.11.23

+ Recent posts