[R 분석: DT] rpart를 이용한 트리 모델 만들기

R 데이터 분석

[R 분석: DT] rpart를 이용한 트리 모델 만들기

YONG_X 2017. 5. 29. 14:47

# rpart는 클래식한 CART를 약간 수정한 버전

# Recursive Partitioning and Regression Trees

# class target에 대해 gini를 사용한 split을,

# 연속형이면 regression tree를 생성

data("USArrests")

# statistics about violent crime rates by us state.

# Murder: Murder arrests (per 100,000)

# Assault: Assault arrests (per 100,000)

# UrbanPop: Percent urban population # target으로 사용

# Rape: Rape arrests (per 100,000)

head(USArrests)

nrow(USArrests) # 50

# install.packages("rpart")

library(rpart) #load the rpart package

t1 <- rpart(UrbanPop ~ ., data = USArrests)

# Anova, Poisson, Exponential 등을 split function으로 선택해 사용 가능

#------- t1 output --------------

n= 50

node), split, n, deviance, yval

* denotes terminal node

1) root 50 10266.4200 65.54000

2) Rape< 17.55 21 4368.9520 58.04762

4) Assault< 94 9 1100.0000 50.66667 *

5) Assault>=94 12 2410.9170 63.58333 *

3) Rape>=17.55 29 3864.9660 70.96552

6) Murder>=12.45 7 847.7143 64.42857 *

7) Murder< 12.45 22 2622.9550 73.04545

14) Rape< 21.2 7 1018.8570 69.14286 *

15) Rape>=21.2 15 1447.7330 74.86667 *

# 노드번호) ; 분류기준 ; 노드별 observation 수,

# deviance = variance ; yval = 평균값

# * 노드가 터미널(leaf) 노드인지 여부

# target이 연속형 변수이므로 회귀분석 수행. 노드에는 각 노드의 평균값을 표시

hist(USArrests$UrbanPop) # 타겟의 값 분포를 확인

plot(sort(USArrests$UrbanPop))

# 트리 플롯 그리기

# install.packages("rattle")

# install.packages("rpart.plot")

library(rattle)

library(rpart.plot)

library(RColorBrewer)

rpart.plot(t1) # 트리 플롯 그리기 기본형

fancyRpartPlot(t1) # 파스텔 톤의 깔끔한 트리 플롯 생성

# 시각적인 분석 -- scatterplot

plot(USArrests$Rape, USArrests$Murder,

col=ifelse(USArrests$UrbanPop>median(USArrests$UrbanPop),

"blue", "lightblue"),

pch=19)

# 살인에 비해 강간비율이 높은 주가 인구 많음

#----- Class 형식의 타겟을 사용하는 예제로 변환 ----

USA1 <- USArrests

USA1$UrbanPop1 <- ifelse(USA1$UrbanPop> median(USArrests$UrbanPop), "Large", "Small")

USA1 <- USA1[, names(USA1)!="UrbanPop"] # UrbanPop 제외

t2 <- rpart(UrbanPop1 ~ ., data = USA1, method = "class")

#----- t2 output ------------

n= 50

node), split, n, loss, yval, (yprob)

* denotes terminal node

1) root 50 24 Small (0.4800000 0.5200000)

2) Rape>=22.7 18 3 Large (0.8333333 0.1666667) *

3) Rape< 22.7 32 9 Small (0.2812500 0.7187500)

6) Murder>=3 24 9 Small (0.3750000 0.6250000)

12) Murder< 7.95 16 7 Large (0.5625000 0.4375000) *

13) Murder>=7.95 8 0 Small (0.0000000 1.0000000) *

7) Murder< 3 8 0 Small (0.0000000 1.0000000) *

# loss = loss matrix 값 (contingency table)

# yval = False or True

# yprob = False and True 각 집단의 확률

fancyRpartPlot(t2)

# default splitting index인 gini 대신 information gain을 split을 위해 사용하는 경우

# minsplit과 minbucket 옵션 사용 가능

# t2 <- rpart(UrbanPop1 ~ ., data = USA1, method = "class",

# parms = list(split = 'information'), minsplit = 2, minbucket = 1)

# ref :: https://gormanalysis.com/decision-trees-in-r-using-rpart/

# 최적의 pruning을 위한 complexity parameter 출력

printcp(t2)

#---- output ----------------------

Classification tree:

rpart(formula = UrbanPop1 ~ ., data = USA1, method = "class")

Variables actually used in tree construction:

[1] Murder Rape

Root node error: 24/50 = 0.48

n= 50

CP nsplit rel error xerror xstd

1 0.500000 0 1.00000 1.33333 0.14142

2 0.041667 1 0.50000 0.54167 0.12923

3 0.010000 3 0.41667 0.58333 0.13229

# pruning rpart tree

# printcp() 결과에서 xerror(cross validation error)가 최소인 CP==0.041 보다 조금 큰 값을 적용

t2 <- prune(t2, cp=.05)

fancyRpartPlot(t2)

저작자표시 비영리 변경금지

'R 데이터 분석' 카테고리의 다른 글

[kbdaa_bda] 빅데이터고객분석 (0)	2017.09.09
subway]빅데이터분석 (0)	2017.06.22
[SKK_DA1] predictive modeling practice (0)	2017.05.25
[SKK_DA1] 시계열 모형 AR-MA-ARIMA 요점 (0)	2017.05.25
[SKK_DA1] Things to Touch (0)	2017.05.25

현재글[R 분석: DT] rpart를 이용한 트리 모델 만들기

리비젼 CRM ( revisioncrm )

프롬프트엔지니어링, 머신러닝, 데이터 사이언티스트, 리비젼, 챗GPT, chatGPT, 인공지능, 전용준, 빅 데이터, 리비젼컨설팅, R, AI, CRM, 빅데이터, 데이터분석, 프롬프트, 데이터 분석, 디지털마케팅, 전용준 빅데이터, GPT,

Today :
Yesterday :