Week 04: Regularized Regression

JAYNUX 2015. 11. 24. 16:07

2015. 11. 24. 16:07

Basic Idea

Fit a regression model
Penalize
Pros:

Can help with the bias / variance trade-off
Cons:
May be computationally demanding on large data sets
Does not perform as well as random forests and boosting

The bias is error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting). The variance is error from sensitivity to small fluctuations in the training set. High variance can cause overfitting: modeling the random noise in the training data, rather than the intended outputs.

A motivating example

where $X1$ and $X2$ are nearly perfectly correlated (co-linear). We can approximate this model by:

$$ Y = \beta_0 + \beta_1X_1+\beta_2X_2+\varepsilon $$

$$ Y = \beta_0 + (\beta_1+\beta_2X_1)+\varepsilon $$

These two functions are not exactly same beacuse we choose to leave one of the predictors out but we can avoid overffting.

Y = β 0 + β 1 X 1 + β 2 X 2 + ε

Prostate cancer

library(ElemStatLearn); data(prostate)
str(prostate)

## 'data.frame':    97 obs. of  10 variables:
##  $ lcavol : num  -0.58 -0.994 -0.511 -1.204 0.751 ...
##  $ lweight: num  2.77 3.32 2.69 3.28 3.43 ...
##  $ age    : int  50 58 74 58 62 50 64 58 47 63 ...
##  $ lbph   : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ svi    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lcp    : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ gleason: int  6 6 7 6 6 6 6 6 6 6 ...
##  $ pgg45  : int  0 0 20 0 0 0 0 0 0 0 ...
##  $ lpsa   : num  -0.431 -0.163 -0.163 -0.163 0.372 ...
##  $ train  : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...

Model selection approach: split samples

No method better when data/computation time permits it
Approach
1. Divide data into training/test/validation
2. Treat validation as test data, train all competing models on the train data and pick the best one on validation.
3. To appropriately assess performance on new data apply to test set
4. You may re-split and reperform steps 1-3
5. Two common problems
Limited data
- Computational complexity

저작자표시

'MOOC > Practical Machine Learning (r programing)' 카테고리의 다른 글

Term Project (0)	2016.04.07
Certification and Comments (0)	2015.12.04
Week03: Model based prediction (0)	2015.11.23
Week03: Boosting (0)	2015.11.23
Week 03: Random Forests (0)	2015.11.23

GOOD to GREAT

Week 04: Regularized Regression

Regularized Regression

jemin lee

2015년 11월 24일

Basic Idea

A motivating example

Model selection approach: split samples

'MOOC > Practical Machine Learning (r programing)' 카테고리의 다른 글

+ Recent posts

티스토리툴바