[kbdaa_bda] 은행마케팅 데이터 분석 실습

R 데이터 분석

[kbdaa_bda] 은행마케팅 데이터 분석 실습

YONG_X 2017. 9. 22. 08:36

# 외부환경 변수가 추가된 데이터 파일 (은행타겟마케팅)

bnk05.csv

bankfa.csv

# 블로그에서 데이터 불러 오기

bnk05 <- read.csv('https://t1.daumcdn.net/cfile/blog/99E6173359C44CE507?download')

Input variables:

# bank client data:

1 - age (numeric)

2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student", "blue-collar","self-employed","retired","technician","services")

3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)

4 - education (categorical: "unknown","secondary","primary","tertiary")

5 - default: has credit in default? (binary: "yes","no") == 파산여부

6 - balance: average yearly balance, in euros (numeric)

7 - housing: has housing loan? (binary: "yes","no") == 주택담보대출을 받았는가

8 - loan: has personal loan? (binary: "yes","no") == 개인대출을 받았는가

# related with the last contact of the current campaign:

(현재 진행중인 캠페인과 관련해서 최종 컨택이 어떻게 되었는가?)

9 - contact: contact communication type (categorical: "unknown","telephone","cellular")

= 캠페인 통화를 휴대폰으로? 집전화나 사무실 등 일반 전화로?

10 - day: last contact day of the month (numeric)

= 몇일에 통화했는가?

11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")

= 몇월달에 통화했는가?

12 - duration: last contact duration, in seconds (numeric)

= 마지막 통화에서 얼마나 오래 통화했나? (초단위 시간)

# other attributes:

13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

= 마지막 컨택을 포함해서 몇번이나 접촉이 이루어졌나?

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

= 직전 캠페인 이후 몇일이나 지났는가?

15 - previous: number of contacts performed before this campaign and for this client (numeric)

= 과거에 몇번이나 캠페인 대상이 되었는가? 접촉했는가?

16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

= 이번이 아니라 지난번 마케팅 캠페인에서의 결과는?

Output variable (desired target):

17 - y - has the client subscribed a term deposit? (binary: "yes","no")

외부환경 관련된 변수 ::

emp.var.rate = 고용변동률

cons.price.idx = 소비자물가지수

cons.conf.idx = 소비자신뢰지수

euribor3m = 유리보 3개월금리

nr.employed = 피고용자수 (취업자수)

# 연령에 따라 대출여부가 차이 있는지를

# 어떤 연령대가 대출이 많은지를 보려고 트리를 생성

library(party)

t1 <- ctree(loan~age, data=bnk05)

plot(t1)

[분석주제1]

loan을 받은 사람들은 어떤 공통적인 특성을 가지고 있는가를

파악하라

- 탐색적인 데이터 시각화(plot 등)

- 의사결정나무 분석

- 군집분석

# 대출여부에 따라 잔고가 많고 적은 차이가 있는가?

plot(bnk05$loan, bnk05$balance)

# 극단적인 사람들이 많아서 차이를 보기 어려움

# 범위를 +/- 1000불 까지로 한정해서 다시 살펴봄

plot(bnk05$loan, bnk05$balance, ylim=c(-1000,1000))

# 40대 중에서 잔고가 많은 사람들이 대출을 더 받을 것이라는

# 가정을 해보고, 그것을 확인하기 위해

# 먼저 잔고가 많다고 구분할 기준점을 파악

# 잔고 전체의 분포는?

plot(bnk05$balance)

plot(sort(bnk05$balance))

# 25% 단위로 끊어서 보면?

quantile(bnk05$balance)

0% 25% 50% 75% 100%

-1137 108 565 1476 64343

# 잔고 1만불 이상인 사람의 수는?

nrow(bnk05[bnk05$balance>=10000, ])

nrow(bnk05[bnk05$balance>=5000, ])

# 2천불 이상인 사람의 수는?

nrow(bnk05[bnk05$balance>=2000, ])

# 비율을 보려면 전체는 몇명이었나?

nrow(bnk05)

# 2000불이 넘는 사람을 구분하는 구분 표시를 추가 항목으로 생성

bnk05$balance_high <- ifelse(bnk05$balance>=2000, "Y", "N")

# 잔고 많은 사람들이 몇명이나 되는지 확인

table(bnk05$balance_high)

plot(as.factor(bnk05$balance_high))

# 40대를 구분하는 구분자 생성

bnk05$is_age_40s <- ifelse(bnk05$age>=40 & bnk05$age<50, "Y", "N")

# 40대와 아닌 사람들, 그리고 잔고 많고 적음 구분 별로 사람수는?

table(bnk05$is_age_40s, bnk05$balance_high)

plot(table(bnk05$is_age_40s, bnk05$balance_high))

# 2000불이 넘는 사람을 구분하는 구분 표시를 추가 항목으로 생성

bnk05$balance_high <- ifelse(bnk05$balance>=2000, "Y", "N")

# 잔고 많은 사람들이 몇명이나 되는지 확인

table(bnk05$balance_high)

plot(as.factor(bnk05$balance_high))

# 40대를 구분하는 구분자 생성

bnk05$is_age_40s <- ifelse(bnk05$age>=40 & bnk05$age<50, "Y", "N")

# 40대와 아닌 사람들, 그리고 잔고 많고 적음 구분 별로 사람수는?

table(bnk05$is_age_40s, bnk05$balance_high)

plot(table(bnk05$is_age_40s, bnk05$balance_high))

# 40대이면서 잔고 많은 사람 구분 (is_40HB = 40대 & High Balance)

bnk05$is_40HB <- ifelse(bnk05$is_age_40s =="Y" & bnk05$balance_high=="Y" , "Y", "N")

# 연령대 구분

bnk05$age_grp <- ifelse(bnk05$age<20, "10대이하", "" )

table(bnk05$age_grp)

bnk05$age_grp <- ifelse(bnk05$age>=20 & bnk05$age<=40, "20~40세", bnk05$age_grp )

table(bnk05$age_grp)

bnk05$age_grp <- ifelse(bnk05$age>40, "40초과", bnk05$age_grp )

barplot(table(bnk05$age_grp))

# 연령구분별로 대출받은 사람의 분포를 비교해서 보기

counts <- table(bnk05$loan, bnk05$age_grp)

barplot(counts, col=c("red", "darkblue", "lightgreen"),

legend = rownames(counts),

beside=TRUE)

# 경영자만 뽑고 연령과 직업만 보려면

head(bnk05[bnk05$job=="management",c("age", "job")])

# 블루칼라이면서 대출을 받은 사람만 뽑고 싶다

bnk051 <-bnk05[ bnk05$job=="blue-collar" & bnk05$loan=="yes", ]

head(bnk051[,c("job", "loan", "age")])

# 연령과 직업을 동시에 고려해본다면?

# 트리를 사용

t1 <- ctree(loan~age+job, data=bnk05)

plot(t1)

# 트리를 그리다 보니 job=='unknown' 을 빼버리고 싶었음

bnk07 <- bnk05[bnk05$loan!='unknown',]

bnk07$loan <- as.character(bnk07$loan)

bnk07$loan <- as.factor(bnk07$loan)

t2 <- ctree(loan~age+job, data=bnk07)

plot(t2)

# 랜덤포리스트로 변수 중요도 보기

library(randomForest)

rf1 <- randomForest(loan ~ age+job+education+marital+y+housing, data=bnk07, do.trace=50, ntree=200, importance=T)

plot(rf1)

varImpPlot(rf1)

#--- 직업별로 대출 비율 높은가 비교 -----

# 먼저 직업별 대출여부별 사람수 구하기

a <- table(bnk07$job, bnk07$loan)

b <- as.data.frame(a)

names(b) <- c("job","loan","Freq")

head(b)

c <- b[b$loan=="yes",c("job","Freq")]

# 직업별 전체 인원수 구하기

d <- table(bnk07$job)

# 결합한 후 비율 계산

c$Freq_sum <- d

c$rate_loan <- c$Freq/c$Freq_sum

# 시각적으로 확인

barplot(c$rate_loan, cex.names=0.6, col=ifelse(c$rate_loan>=0.2,"red", "lightblue"),

main="직업별 대출자 비율 분포")

[분석주제2]

20대 고객들 중에 가장 매력적인 집단은 어떤 집단인가?

# 20대만 추출

a20s <- bnk05[bnk05$age>=20 & bnk05$age<30,]

# 몇명인가

nrow(a20s)

# 직업분포를 살펴봄

plot(a20s$job)

# 직업명이 작아서 글자 보이게 수정

plot(a20s$job, cex.names=0.6)

# 20대 고객들의 평균적인 잔고는? 결혼상태에 따라서

agg1 <-aggregate(a20s$balance, by=list(a20s$marital),

FUN=mean, na.rm=TRUE)

plot(agg1, main="mean balance by marital status of 20 something")

# 같은 20대라도 언제, 그리고 잔고가 얼마나 있나에 따라서,

# 대출을 집이든 개인이든 받는가?

plot(jitter(a20s$age), a20s$balance, col=ifelse(a20s$housing=='yes' | a20s$loan=='yes', "red", "grey"))

[분석주제3]

직업유형 중 한가지를 골라서 새로운 대출상품을 개발하고자 한다.

어떤 직업군을 선택할 것인가?

# 직업별 잔고 평균

# 먼저 집계

agg1 <-aggregate(bnk05$balance, by=list(bnk05$job),

FUN=mean, na.rm=TRUE)

# 이름 변경

names(agg1) <- c("job", "mean_bal")

# 그림으로 보기

barplot(agg1[order(agg1$mean_bal),]$mean_bal, names.arg=agg1$job, cex.names=0.5)

[연습문제: 분석주제4]

잔고가 많은 사람들은 어떤 공통적인 특성을 가지고 있는가?

- 잔고와 관련성이 가장 큰 변수는 무엇인가?

plot(bnk05$marital)

plot(as.numeric(bnk05$marital), bnk05$balance)

# factor 형식의 marital을 숫자형식으로 변경해서

# jitter는 점을 약간씩 흩어놓는 함수

# cex는 점의 크기 조절

plot(jitter(as.numeric(bnk05$marital)), jitter(bnk05$balance), cex=0.5, ylim=c(0,30000))

# 디시젼트리를 사용한 balance 분석

install.packages('party')

library(party)

# 연령과 잔고

t1 <- ctree(balance~age, data=bnk05)

plot(t1)

# 결혼상태와 잔고

t2 <- ctree(balance~marital, data=bnk05)

plot(t2)

# 연령과 결혼상태 동시 고려할 경우의 잔고

t3 <- ctree(balance~marital+age, data=bnk05)

plot(t3)

# balance 구별을 위한 의사결정나무 결과

저작자표시 비영리 변경금지

'R 데이터 분석' 카테고리의 다른 글

[DS_PCBA] 예측적 고객행동 분석 (0)	2017.10.16
[kbdaa_bda] 블로그 크롤링 후 텍스트 분석 (0)	2017.09.22
[kbdaa_bda] 시계열예측 (0)	2017.09.21
[kbdaa_bda] 데이터 처리 연습 GDA (0)	2017.09.21
[kbdaa_bda] 고객빅데이터분석 _은행모델 (0)	2017.09.21

현재글[kbdaa_bda] 은행마케팅 데이터 분석 실습

리비젼 CRM ( revisioncrm )

chatGPT, 데이터분석, 챗GPT, 전용준 빅데이터, 빅 데이터, 리비젼컨설팅, 프롬프트엔지니어링, 인공지능, R, 프롬프트, 디지털마케팅, 전용준, 빅데이터, AI, 데이터 사이언티스트, GPT, CRM, 머신러닝, 데이터 분석, 리비젼,

Today :
Yesterday :