Week02: Covariate creation, Pre-processing with principal components analysis
- Caret package
- Data slicing
- Training options
- Plotting predictions
- Basic preprocessing
- Covariate creation
- Preprocessing with principal components analysis
- Predicting with Regression
- Predicting with Regression Multiple Covariates
Covariate creation
covariates are sometimes called predictors and features
There are two levels of covariate creation, or feature creation
Level 1: From raw data to covariate
raw data takes the form of an image, text file, or a website.
That kind of information is very hard to build a predictive model around when you haven't summarized the information in some useful way in to either a quantitative or qualitative variable.
e-mail 내용이 아래와 같이 있다고 하자.
그냥 저 이메일을 처리할 수는 없기 때문에 2개의 feature를 끌어 낸다.
captialAve는 당연히 모두 대문자이므로 100% = 1 이다.
particular word frequency를 고려하는데, you가 있을 수 있다. 2이다.
number of dollar 의 경우 8개이다.
Level 2: Transforming tidy covariates
sqaure를 한다.
Level 1, Raw data -> covariates
Depends heavily on application
The balancing act is summarization vs. information loss
Examples:
Text files: frequency of words, frequency of phrases (Google ngrams), frequency of capital letters.
Images: Edges, corners, blobs, ridges (computer vision feature detection))
Webpages: Number and type of images, position of elements, colors, videos (A/B Testing)
People: Height, weight, hair color, sex, country of origin.
The more knowledge of the system you have the better the job you will do.
When in doubt, err on the side of more features
Can be automated, but use caution!
Level 2, Tidy covariates -> new covariates
More necessary for some methods (regression, svms) than for others (classification trees).
Should be done only on the training set
The best approach is through exploratory analysis (plotting/tables)
New covariates should be added to data frames
dummy variable
removing zero covariates
Spline basis
# spline basis
#data
library(ISLR); library(caret); data(Wage);
inTrain <- createDataPartition(y=Wage$wage,
p=0.7, list=FALSE)
training <- Wage[inTrain,]; testing <- Wage[-inTrain,]
library(splines)
bsBasis <- bs(training$age,df=3)
bsBasis
#linear model
lm1 <- lm(wage ~ bsBasis,data=training)
plot(training$age,training$wage,pch=19,cex=0.5)
points(training$age,predict(lm1,newdata=training),col="red",pch=19,cex=0.5)
Notes and further reading
quantitative variable이 많이 존재 한다면,
# Correlated predictors
library(caret); library(kernlab); data(spam)
inTrain <- createDataPartition(y=spam$type,
p=0.75, list=FALSE)
training <- spam[inTrain,]
testing <- spam[-inTrain,]
M <- abs(cor(training[,-58]))
diag(M) <- 0
which(M > 0.8,arr.ind=T)
row col
num415 34 32
direct 40 32
num857 32 34
direct 40 34
num857 32 40
num415 34 40
# drawing graph
names(spam)[c(34,32)]
plot(spam[,34],spam[,32])
두개의 correlation은 높았으니 그래프를 그려서 추가로 분석해 본다.
거의 대충형 선에 근접하는것을 알 수 있다.
이것을 통해서 두개의 feature들을 모두 모델에 포함하는 것은 그렇게 좋은 생각이 아니라는것을 알 수 있다.
Basic PCA idea
단하나의 variable이 더 유용할 것이라는 것을 어떻게 알 수 있을까?
We might not need every predictor
A weighted combination of predictors might be better
We should pick this combination to capture the "most information" possible
Benefits
- Reduced number of predictors
- Reduced noise (due to averaging)
# we could rotate the plot
X <- 0.71*training$num415 + 0.71*training$num857
Y <- 0.71*training$num415 - 0.71*training$num857
plot(X,Y)
위와 같은 결과를 얻을 수 있다. X값들은 넓게 퍼져 있다.
하지만 y의 경우 대부분 0에 모여 있는것을 알 수 있다.
2개 이상의 variable에 적용하기 위해서 PCA 함수를 이용하는것이 좋다.
#PCA
smallSpam <- spam[,c(34,32)]
prComp <- prcomp(smallSpam)
plot(prComp$x[,1],prComp$x[,2])
'MOOC > Practical Machine Learning (r programing)' 카테고리의 다른 글
Week 02: Predicting with Regression Multiple Covariates (0) | 2015.11.16 |
---|---|
Week 02: Predicting with Regression (0) | 2015.11.16 |
Week 02: Basic preprocessing, Covariate creation (0) | 2015.11.15 |
Week 02: Training options, Plotting predictions (0) | 2015.11.13 |
Week 02: Caret Package, dataSlicing (1) | 2015.11.13 |