Principal Component Analysis (PCA)

JAYNUX 2015. 11. 18. 14:53

2015. 11. 18. 14:53

Principal Component Analysis (PCA)

주성분 분석은 데이터에 많은 변수가 있을 때 변수의 수를 줄이는 차원 감소(Dimensionality Reduction)기법 중 하나다.

PCA는 변수들을 주성분(Principal Component)이라 부르는 선형적인 상관관계가 없는 다른 변수들로 재표현한다.

무슨말인가 하면,

선형적인 관계가 있는 두 변수가 아래와 같이 있다고 가정해보자.

A = 2c+3

B= 3c

이다.

이때 Y = f(A,B)라는 어떤 모델이 있다고 하자.

이럴때 실제로 A,B 둘 사이에는 C라는 변수에 의한 선형적 관계가 있는데 둘 다 이용해서 Y를 예측하는 모델을 생성할 필요가 있을까?

이와 같을 때는 f(A,B)로 Y를 예측하기 보다는 그냥 f(C)로 Y를 예측하는것이 더 좋다.

Principal component들은 원 데이터의 분산(퍼짐 정도)을 최대한 보존하는 방법으로 구한다.

좀 더 자세한 내용은 아래의 이전 포스트를 참조하자.

practical machine learning (johns hopkins)

machine learning (standford)

R을 이용해서 실제로 적용해 보기 [R프로그래밍 실습서]

pincomp() 함수 또는 preComp()를 이용한다.

pincomp: 주성분 분석을 수행 한다.

x, # 행렬 또는 데이터 프레임

cor=FASLE # cor=FALSE면 공분산 행렬, TRUE면 상관 행렬을 사용한 주성분 분석을 한다.

아래와 같이 1:10을 저장한 x

약간의 noise를 추가한 y. 즉 x + noise = y 이다.

마지막으로 z는 x + y 에 noise를 추가한 것이다.

x <- 1:10

y <- x + runif(10, min=-.5, max=.5)

z <- x + y + runif(10, min=-10, max=.10)

data frame으로 내용을 출력해보면 아래와 같다.

> (data <- data.frame(x,y,z))

x y z

1 1 1.413629 -0.5826325

2 2 2.021631 -0.9343342

3 3 3.045810 -3.1059315

4 4 3.857941 6.3310046

5 5 5.418124 0.4202864

6 6 6.306387 6.6968121

7 7 7.488154 13.1231456

8 8 8.148416 12.0075606

9 9 9.260750 10.7979732

10 10 10.086332 16.6841931

# do PCA using data

pr <- princomp(data)

summary(pr)

Importance of components:

Comp.1 Comp.2 Comp.3

Standard deviation 7.5667362 1.5335677 0.1356764036

Proportion of Variance 0.9602481 0.0394432 0.0003087272

Cumulative Proportion 0.9602481 0.9996913 1.0000000000

위 데이터들은 주성분들이 원 데이터의 분산 중 얼마만큼을 설명해 주는지를 알 수 있다.

Proportion of Variance 행을 보면 첫 번째 주성분은 데이터의 분산 중 92.86%를 설명해 주며,

두 번째 주성분은 데이터의 분산 중 7.1%를 설명함을 알 수 있다.

세 번째 주성분은 가장작은 0.03%의 분산을 설명 한다.

마지막 행의 Cumulative Proportion은 Proportion of Variance의 누적 값이다.

결국 원 데이터의 분산은 첫 번째와 두 번째 주성분에 의해 99.97%가 포함됨을 알 수 있다.

이들 두 주성분상의 좌표는 scores를 보고 구하면 된다.

결국 x,y,z의 데이터 분포는 그냥 2개의 차원으로 축소가 된것이다. 아래와 같은 좌표값을 가지게 된다.

> pr$scores[,1:2]

Comp.1 Comp.2

[1,] 8.9555327 1.91623469

[2,] 8.6788410 0.76263371

[3,] 9.8173286 -1.57583908

[4,] 1.0457826 2.13444485

[5,] 5.2072178 -2.44130849

[6,] -0.8729855 -0.38916205

[7,] -7.1881748 1.55801181

[8,] -6.8265993 -0.01728680

[9,] -6.5475880 -1.91981691

[10,] -12.2693552 -0.02791173

알츠하이머(Alzheimer)를 이용한 예제 [Practical Machine Learning HW-2]

데이터를 불러온다.

set.seed(3433)

library(AppliedPredictiveModeling)

data(AlzheimerDisease)

adData = data.frame(diagnosis, predictors)

inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]]

training = adData[inTrain, ]

testing = adData[-inTrain, ]

위 데이터 셋에서 "IL"로 시작하는 트레이닝 셋 variables(변수)들에 대해서만 PCA를 수행한다.

preProcess()를 이용하며, PCA 수행결과 80%의 variance(분산)를 capture할 수 있는 변수들의 갯수를 결정해 보자.

먼저 grep과 정규표현으로 해당 열 데이터를 수집한다.

IL_str <- grep("^IL", colnames(training), value = TRUE)

아래와 같이 caret package의 preProcess()를 이용해서 pca를 수행한다. 임계값은 80%이다.

preProc <- preProcess(training[, IL_str], method = "pca", thresh = 0.8)

실행 결과 총 12개의 IL 변수중 7개만으로 원래 데이터 분포의 80%수준의 분산을 설명 할 수 있음을 알 수 있다.

preProc$rotation

PC1 PC2 PC3 PC4 PC5 PC6 PC7

IL_11 -0.06529786 0.5555956867 0.2031317937 -0.050389599 0.73512798 -0.102014559 0.20984151

IL_13 0.27529157 0.3559427297 -0.0399010765 0.265076920 -0.25796332 -0.068927711 0.58942516

IL_16 0.42079000 0.0007224953 0.0832211446 -0.082097273 0.04435883 -0.007094672 -0.06581741

IL_17E -0.01126118 0.5635958176 0.3744707126 0.302512329 -0.38918707 0.221149380 -0.46462692

IL_1alpha 0.25078195 -0.0687043488 -0.3008366900 0.330945942 0.16992452 0.742391473 0.12787035

IL_3 0.42026485 -0.0703352892 -0.1049647272 -0.065352774 0.02352819 -0.165587911 -0.09006656

IL_4 0.33302031 0.0688495706 -0.1395450144 0.165631691 -0.14268797 -0.297421293 0.19661173

IL_5 0.38706503 -0.0039619980 0.0005616126 -0.224448981 0.08426042 0.153835977 -0.16425757

IL_6 0.05398185 -0.4248425653 0.6090821756 0.417591202 -0.00165066 -0.166089521 0.21895103

IL_6_Receptor 0.21218980 0.1005338329 0.2920341087 -0.659953479 -0.29654048 0.138000448 0.22657846

IL_7 0.32948731 0.0806070090 -0.1966471906 0.165544952 0.11373532 -0.405698338 -0.42065832

IL_8 0.29329723 -0.1883039842 0.4405255221 0.002811187 0.28608600 0.184321013 -0.14833779

이전과 같이 princomp를 수행하면 아래와 같다.

summary(princomp(training[, IL_str]))

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 Comp.11

Standard deviation 1.4947237 1.2328654 1.1784414 0.62482551 0.5437811 0.40042942 0.36239787 0.30216932 0.25401803 0.25223488 0.0286080996

Proportion of Variance 0.3523411 0.2397026 0.2190067 0.06156856 0.0466326 0.02528678 0.02071156 0.01439933 0.01017585 0.01003348 0.0001290683

Cumulative Proportion 0.3523411 0.5920437 0.8110505 0.87261902 0.9192516 0.94453840 0.96524995 0.97964928 0.98982513 0.99985861 0.9999876814

Comp.12

Standard deviation 8.838105e-03

Proportion of Variance 1.231856e-05

Cumulative Proportion 1.000000e+00

PCA가 정확도에 미치는 영향을 glm 모델을 만들어서 알아보자 [Practical Machine Learning HW-2]

set.seed(3433)

library(AppliedPredictiveModeling)

data(AlzheimerDisease)

adData = data.frame(diagnosis, predictors)

inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]]

training = adData[inTrain, ]

testing = adData[-inTrain, ]

set.seed(3433)

## grep the predictors starting with 'IL'

IL_str <- grep("^IL", colnames(training), value = TRUE)

## make a subset of these predictors

predictors_IL <- predictors[, IL_str]

df <- data.frame(diagnosis, predictors_IL)

inTrain = createDataPartition(df$diagnosis, p = 3/4)[[1]]

training = df[inTrain, ]

testing = df[-inTrain, ]

PCA를 적용하지 않은 glm 모델

## train the data using the first method

modelFit <- train(diagnosis ~ ., method = "glm", data = training)

predictions <- predict(modelFit, newdata = testing)

## get the confustion matrix for the first method

C1 <- confusionMatrix(predictions, testing$diagnosis)

print(C1)

Confusion Matrix and Statistics

Reference

Prediction Impaired Control

Impaired 2 9

Control 20 51

Accuracy : 0.6463

95% CI : (0.533, 0.7488)

No Information Rate : 0.7317

P-Value [Acc > NIR] : 0.96637

Kappa : -0.0702

Mcnemar's Test P-Value : 0.06332

Sensitivity : 0.09091

Specificity : 0.85000

Pos Pred Value : 0.18182

Neg Pred Value : 0.71831

Prevalence : 0.26829

Detection Rate : 0.02439

Detection Prevalence : 0.13415

Balanced Accuracy : 0.47045

'Positive' Class : Impaired

PCA를 적용해서 수행한 결과이다.

# prediction with PCA

A1 <- C1$overall[1]

## do similar steps with the caret package

modelFit <- train(training$diagnosis ~ ., method = "glm", preProcess = "pca",

data = training, trControl = trainControl(preProcOptions = list(thresh = 0.8)))

C2 <- confusionMatrix(testing$diagnosis, predict(modelFit, testing))

print(C2)

Confusion Matrix and Statistics

Reference

Prediction Impaired Control

Impaired 3 19

Control 4 56

Accuracy : 0.7195

95% CI : (0.6094, 0.8132)

No Information Rate : 0.9146

P-Value [Acc > NIR] : 1.000000

Kappa : 0.0889

Mcnemar's Test P-Value : 0.003509

Sensitivity : 0.42857

Specificity : 0.74667

Pos Pred Value : 0.13636

Neg Pred Value : 0.93333

Prevalence : 0.08537

Detection Rate : 0.03659

Detection Prevalence : 0.26829

Balanced Accuracy : 0.58762

'Positive' Class : Impaired

적용전 0.65

적용후 0.72

결국 dimentionality reduction이 정확도 향상에도 기여한다는 것을 알 수 있다.

저작자표시

'AI > Machine Learning with R' 카테고리의 다른 글

Ensemble method: Bagging (bootstrap aggregating) (0)	2015.11.19
Bootstrapping (0)	2015.11.19
Evaluating Model Performance with R (2)	2015.11.12
Naive Bayes with Caret package (R) (0)	2015.11.12
Naive Bayes Classification with e1071 package (R) (0)	2015.10.29

GOOD to GREAT

Principal Component Analysis (PCA)

'AI > Machine Learning with R' 카테고리의 다른 글

+ Recent posts

티스토리툴바