Regularized Regression
jemin lee
2015년 11월 24일
Basic Idea
- Fit a regression model
- Penalize
Pros:
- Can help with the bias / variance trade-off
Cons: - May be computationally demanding on large data sets
- Does not perform as well as random forests and boosting
The bias is error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting). The variance is error from sensitivity to small fluctuations in the training set. High variance can cause overfitting: modeling the random noise in the training data, rather than the intended outputs.
A motivating example
where $X1$ and $X2$ are nearly perfectly correlated (co-linear). We can approximate this model by:
$$ Y = \beta_0 + \beta_1X_1+\beta_2X_2+\varepsilon $$
$$ Y = \beta_0 + (\beta_1+\beta_2X_1)+\varepsilon $$
These two functions are not exactly same beacuse we choose to leave one of the predictors out but we can avoid overffting.
Prostate cancer
library(ElemStatLearn); data(prostate)
str(prostate)
## 'data.frame': 97 obs. of 10 variables:
## $ lcavol : num -0.58 -0.994 -0.511 -1.204 0.751 ...
## $ lweight: num 2.77 3.32 2.69 3.28 3.43 ...
## $ age : int 50 58 74 58 62 50 64 58 47 63 ...
## $ lbph : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ svi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lcp : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ gleason: int 6 6 7 6 6 6 6 6 6 6 ...
## $ pgg45 : int 0 0 20 0 0 0 0 0 0 0 ...
## $ lpsa : num -0.431 -0.163 -0.163 -0.163 0.372 ...
## $ train : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
Model selection approach: split samples
- No method better when data/computation time permits it
- Approach
- Divide data into training/test/validation
- Treat validation as test data, train all competing models on the train data and pick the best one on validation.
- To appropriately assess performance on new data apply to test set
- You may re-split and reperform steps 1-3
- Two common problems
- Limited data
- Computational complexity
'MOOC > Practical Machine Learning (r programing)' 카테고리의 다른 글
Term Project (0) | 2016.04.07 |
---|---|
Certification and Comments (0) | 2015.12.04 |
Week03: Model based prediction (0) | 2015.11.23 |
Week03: Boosting (0) | 2015.11.23 |
Week 03: Random Forests (0) | 2015.11.23 |