>RE::VISION CRM

Python데이터분석

파이썬: 실전팁 for EDA [전용준.리비젼컨설팅]

YONG_X 2019. 8. 27. 14:50


파이썬: 실전팁 for EDA [전용준.리비젼컨설팅]


(지극히 주관적인...) 실전에서의 탐색적 데이터 분석 방법 (Tips) 몇 가지

요점은 .... 

? [1] 익숙한 하나의 라이브러리를 집중적으로 사용한다 (esp. plt.scatter ) 

? [2] 많은 변수가 있다면 핵심변수에 우선 집중한다 (esp. Target이 있는 경우) 

? [3] 세개의 변수를 한 Set으로 (그리고 3 변수 까지만. 그리고, No 3D plot)


__Plus__

?  실전에서 그림이 예쁜지는 중요 X

?  초기 EDA는 보고용이 아니라 분석용

?  극명한 패턴을 빠르게 찾아내야



[ #머신러닝 #EDA #파이썬 #전용준 #리비젼 #리비젼컨설팅 

#탐색적데이터분석 #데이터분석 #python #visualization

#scatter #decisiontree #catboost #variableimportance #변수중요도 ]


머신러닝, EDA, 파이썬, 전용준, 리비젼, 리비젼컨설팅, 탐색적데이터분석, 데이터분석, python, visualization, scatter, decisiontree, catboost, variable importance, 변수중요도






[ 유튜브 영상 ]




How I Do EDA :: Retail Customer Analysis -- BuyIt.com

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from numpy.polynomial.polynomial import polyfit
import matplotlib.style as style 
from IPython.display import Image
import warnings
warnings.filterwarnings('ignore')
# define random jitter
def rjitt(arr):
    stdev = .01*(max(arr)-min(arr))
    return arr + np.random.randn(len(arr)) * stdev
dataPath = 'C:/YONG/m1710/myPydata/'
def rjitt2(arr):
    stdev = .031*(max(arr)-min(arr))
    return arr + np.random.randn(len(arr)) * stdev
# min max scaler
def mnmx_scl(vec):
    vec = (vec-vec.min())/(vec.max()-vec.min())
    return(vec)
def mnmx_scl2(vec):
    vec = np.where(np.percentile(vec, 95) < vec,
                   np.percentile(vec, 95), vec)
    vec = np.array(vec)
    vec = (vec-vec.min())/(vec.max()-vec.min())
    return(vec)

A Practitioner's Tips for EDA - How I Do EDA


[전용준. 리비젼컨설팅. 머신러닝]

? [1] 익숙한 하나의 라이브러리를 집중적으로 사용한다 (esp. plt.scatter )
? [2] 많은 변수가 있다면 핵심변수에 우선 집중한다 (esp. Target이 있는 경우)
? [3] 세개의 변수를 한 Set으로 (그리고 3 변수 까지만. 그리고, No 3D plot)

In [2]:
# 블로그로부터 연습용 CSV 형식의 데이터 불러오기
dff01 = pd.read_csv('https://t1.daumcdn.net/cfile/blog/992CFF3B5D5CC70C2C?download')
print('Shape of the dataset : ', dff01.shape)
dff01.head()
Shape of the dataset :  (5000, 5)
Out[2]:
ageheightweightamt_strbkamt_book
02815752223000
12815447351000
22815552213000
3271554400
42815551175000

[1] 익숙한 하나의 라이브러리를 집중적으로 사용한다 (esp. plt.scatter() )

In [3]:
# matplotlib scatter 사용
plt.scatter(dff01.amt_strbk, dff01.amt_book)
plt.xlabel('AMT STARBUCKS')
plt.ylabel('AMT BOOK')
plt.show()
In [4]:
# alpha 투명도 옵션을 활용
plt.scatter(dff01.amt_strbk, dff01.amt_book, s=5, alpha=0.2)
plt.xlabel('AMT STARBUCKS')
plt.ylabel('AMT BOOK')
plt.show()
In [5]:
# matplotlib scatter 사용
plt.scatter(rjitt(dff01.amt_strbk), rjitt(dff01.amt_book), s=5, alpha=0.2)
plt.xlabel('AMT STARBUCKS')
plt.ylabel('AMT BOOK')
plt.show()
In [6]:
# target이 있다면 target을 표시
median_age = dff01.age.median()
colors1 = ['red' if x>median_age else 'blue' for x in dff01.age]
plt.scatter(rjitt(dff01.amt_strbk), rjitt(dff01.amt_book), 
            s=5, alpha=0.2,
           color=colors1)
plt.xlabel('AMT STARBUCKS')
plt.ylabel('AMT BOOK')
plt.title('blue: young')
plt.show()

? matplotlib도 seaborn도 pandas도 좋다. 문제는 우왕좌왕하면서 구글검색에 시간을 낭비할 필요가 없다는 것

In [7]:
# 보조선은 시각적인 판단을 빠르게 하는데 크게 기여한다
median_age = dff01.age.median()
colors1 = ['red' if x>median_age else 'blue' for x in dff01.age]
plt.scatter(rjitt(dff01.amt_strbk), rjitt(dff01.amt_book), 
            s=5, alpha=0.2,
           color=colors1)
plt.xlabel('AMT STARBUCKS')
plt.ylabel('AMT BOOK')
plt.title('blue: young ; orange line:20%')
plt.axvline(np.percentile(dff01.amt_strbk,80), 
            color='orange',linestyle=':')
plt.axhline(np.percentile(dff01.amt_book,80), 
            color='orange',linestyle=':')
plt.show()

Findings :


? 서적 구매가 전혀 없는 고객들이 다수
? 스타벅스 구매는 대부분의 고객들이 약간이라도 있음
? 젊은 여성쪽에서 스타벅스와 서적 모두의 구매금액이 큰 경우 많음
? 스타벅스와 서적 모두 구매한 집단에서는 두 품목간 양의 상관관계 존재하는듯

[2] 많은 변수가 있다면 핵심변수에 우선 집중한다 (esp. Target이 있는 경우)


? 변수가 여러개라면 DT나 RF(?), CatBoost로 변수 중요도를 구한다 (아직은 __모델__이 아니다. 변수 탐색을 위한 보조Tool일 뿐)
? RF를 사용하게되면 Categorical Variable을 encoding하는 번거러움이 있으니 __CatBoost__가 편하다

In [8]:
# searching for key variables using DT
from sklearn.tree import DecisionTreeRegressor
dt1 = DecisionTreeRegressor(min_samples_split=50, max_depth=3, min_samples_leaf=10, random_state=99)
dtfeatures = ['height','weight','amt_strbk','amt_book']
tgt = 'age'
dfdt = dff01
dt1.fit(dfdt[dtfeatures], dfdt[tgt])
# path should have set
# or [ InvocationException: GraphViz's executables not found ] occur 
import os     
os.environ["PATH"] += os.pathsep + 'C:\\Program Files\\Anaconda3\\Library\\bin\\graphviz'
from sklearn import tree
from IPython.display import Image  
import pydotplus
import graphviz
print('variable importance : ', dict(zip(dfdt[dtfeatures].columns, dt1.feature_importances_)))
# Create DOT data
dot_data = tree.export_graphviz(dt1, out_file=None, 
                                feature_names=dtfeatures)
# Draw graph
graph = pydotplus.graph_from_dot_data(dot_data)  
# Show graph
Image(graph.create_png())
variable importance :  {'amt_strbk': 0.07748767837764613, 'amt_book': 0.19937549317012487, 'weight': 0.0, 'height': 0.7231368284522289}
Out[8]:
In [9]:
# searching for key variables
# A deeper and bush tree can show more info on Var Imp
from sklearn.tree import DecisionTreeRegressor
dt1 = DecisionTreeRegressor(min_samples_split=30, max_depth=5, 
                            min_samples_leaf=5, random_state=99)
dtfeatures = ['height','weight','amt_strbk','amt_book']
tgt = 'age'
dfdt = dff01
dt1.fit(dfdt[dtfeatures], dfdt[tgt])
# path should have set
# or [ InvocationException: GraphViz's executables not found ] occur 
import os     
os.environ["PATH"] += os.pathsep + 'C:\\Program Files\\Anaconda3\\Library\\bin\\graphviz'
from sklearn import tree
from IPython.display import Image  
import pydotplus
import graphviz
print('variable importance : ', dict(zip(dfdt[dtfeatures].columns, dt1.feature_importances_)))
# Create DOT data
dot_data = tree.export_graphviz(dt1, out_file=None, 
                                feature_names=dtfeatures)
# Draw graph
graph = pydotplus.graph_from_dot_data(dot_data)  
# Show graph
Image(graph.create_png())
variable importance :  {'amt_strbk': 0.10799550290753211, 'amt_book': 0.22525717355765157, 'weight': 0.036030155469811656, 'height': 0.6307171680650047}
Out[9]:
In [10]:
d = dict(zip(dfdt[dtfeatures].columns, dt1.feature_importances_))
d.items()
ddf = pd.DataFrame(pd.Series(d, name='Importance')).reset_index().sort_values('Importance')
ddf.columns = ['VarName', 'Importance']
ddf
plt.barh(ddf['VarName'], ddf['Importance'], color='navy')
plt.title('Features Predicting AGE')
plt.xlabel('Relative Feature Importance from DT')
plt.show()

? __CatBoost__는 여러개의 트리를 생성하기 때문에 보다 안정적인 (=믿음직스런) 결과를 준다
? CatBoost는 한 두개 핵심변수 이외의 변수들의 중요도도 나름 공평하게 인정해준다.
(핵심변수의 영향을 과대평가하는 것을 방지하는데 도움이 된다. 다만, 실행에 조금 더 많은 리소스가 필요하다)

In [11]:
from catboost import CatBoostRegressor
model = CatBoostRegressor(iterations=300, depth=3, learning_rate=0.01,
                           eval_metric='R2', 
                           use_best_model=True,
                           random_seed=42)
train_x = dfdt[dtfeatures] 
tgt = 'age'
train_y = dfdt[tgt]
model.fit(
    train_x, train_y,
    # cat_features=categorical_features_indices,
    # verbose=True,  # you can uncomment this for text output
    # plot=True, # does not work for classifier
    eval_set=(train_x, train_y)    
)
0:	learn: -0.0000008	test: -0.0000008	best: -0.0000008 (0)	total: 86.7ms	remaining: 25.9s
1:	learn: 0.0000145	test: 0.0000145	best: 0.0000145 (1)	total: 104ms	remaining: 15.6s
2:	learn: 0.0000157	test: 0.0000157	best: 0.0000157 (2)	total: 124ms	remaining: 12.3s
3:	learn: 0.0000245	test: 0.0000245	best: 0.0000245 (3)	total: 139ms	remaining: 10.3s
4:	learn: 0.0000275	test: 0.0000275	best: 0.0000275 (4)	total: 153ms	remaining: 9.04s
5:	learn: 0.0000343	test: 0.0000343	best: 0.0000343 (5)	total: 167ms	remaining: 8.18s
6:	learn: 0.0000378	test: 0.0000378	best: 0.0000378 (6)	total: 181ms	remaining: 7.59s
7:	learn: 0.0000520	test: 0.0000520	best: 0.0000520 (7)	total: 196ms	remaining: 7.14s
8:	learn: 0.0000648	test: 0.0000648	best: 0.0000648 (8)	total: 210ms	remaining: 6.78s
9:	learn: 0.0000703	test: 0.0000703	best: 0.0000703 (9)	total: 224ms	remaining: 6.5s
10:	learn: 0.0000746	test: 0.0000746	best: 0.0000746 (10)	total: 238ms	remaining: 6.25s
11:	learn: 0.0000823	test: 0.0000823	best: 0.0000823 (11)	total: 252ms	remaining: 6.04s
:
:
:
24:	learn: 0.0002110	test: 0.0002110	best: 0.0002110 (24)	total: 454ms	remaining: 4.99s
25:	learn: 0.0002192	test: 0.0002192	best: 0.0002192 (25)	total: 469ms	remaining: 4.94s
26:	learn: 0.0002348	test: 0.0002348	best: 0.0002348 (26)	total: 486ms	remaining: 4.92s
27:	learn: 0.0002588	test: 0.0002588	best: 0.0002588 (27)	total: 507ms	remaining: 4.93s
28:	learn: 0.0002663	test: 0.0002663	best: 0.0002663 (28)	total: 523ms	remaining: 4.89s
29:	learn: 0.0002720	test: 0.0002720	best: 0.0002720 (29)	total: 538ms	remaining: 4.84s
30:	learn: 0.0002834	test: 0.0002834	best: 0.0002834 (30)	total: 552ms	remaining: 4.79s
31:	learn: 0.0002998	test: 0.0002998	best: 0.0002998 (31)	total: 567ms	remaining: 4.75s
32:	learn: 0.0003047	test: 0.0003047	best: 0.0003047 (32)	total: 580ms	remaining: 4.7s
33:	learn: 0.0003309	test: 0.0003309	best: 0.0003309 (33)	total: 595ms	remaining: 4.65s
34:	learn: 0.0003503	test: 0.0003503	best: 0.0003503 (34)	total: 609ms	remaining: 4.61s
35:	learn: 0.0003794	test: 0.0003794	best: 0.0003794 (35)	total: 623ms	remaining: 4.57s
36:	learn: 0.0003900	test: 0.0003900	best: 0.0003900 (36)	total: 639ms	remaining: 4.54s
37:	learn: 0.0004000	test: 0.0004000	best: 0.0004000 (37)	total: 656ms	remaining: 4.52s
38:	learn: 0.0004141	test: 0.0004141	best: 0.0004141 (38)	total: 672ms	remaining: 4.49s
39:	learn: 0.0004260	test: 0.0004260	best: 0.0004260 (39)	total: 691ms	remaining: 4.49s
40:	learn: 0.0004466	test: 0.0004466	best: 0.0004466 (40)	total: 715ms	remaining: 4.51s
41:	learn: 0.0004641	test: 0.0004641	best: 0.0004641 (41)	total: 734ms	remaining: 4.5s
42:	learn: 0.0004782	test: 0.0004782	best: 0.0004782 (42)	total: 749ms	remaining: 4.47s
43:	learn: 0.0004882	test: 0.0004882	best: 0.0004882 (43)	total: 763ms	remaining: 4.44s
44:	learn: 0.0005046	test: 0.0005046	best: 0.0005046 (44)	total: 777ms	remaining: 4.4s
45:	learn: 0.0005136	test: 0.0005136	best: 0.0005136 (45)	total: 791ms	remaining: 4.37s
46:	learn: 0.0005242	test: 0.0005242	best: 0.0005242 (46)	total: 805ms	remaining: 4.33s
47:	learn: 0.0005361	test: 0.0005361	best: 0.0005361 (47)	total: 819ms	remaining: 4.3s
48:	learn: 0.0005556	test: 0.0005556	best: 0.0005556 (48)	total: 833ms	remaining: 4.26s
49:	learn: 0.0005711	test: 0.0005711	best: 0.0005711 (49)	total: 846ms	remaining: 4.23s
50:	learn: 0.0005877	test: 0.0005877	best: 0.0005877 (50)	total: 861ms	remaining: 4.2s
51:	learn: 0.0006119	test: 0.0006119	best: 0.0006119 (51)	total: 875ms	remaining: 4.17s
52:	learn: 0.0006263	test: 0.0006263	best: 0.0006263 (52)	total: 893ms	remaining: 4.16s
53:	learn: 0.0006436	test: 0.0006436	best: 0.0006436 (53)	total: 914ms	remaining: 4.16s
54:	learn: 0.0006615	test: 0.0006615	best: 0.0006615 (54)	total: 930ms	remaining: 4.14s
55:	learn: 0.0006751	test: 0.0006751	best: 0.0006751 (55)	total: 945ms	remaining: 4.12s
56:	learn: 0.0006956	test: 0.0006956	best: 0.0006956 (56)	total: 959ms	remaining: 4.09s
57:	learn: 0.0007233	test: 0.0007233	best: 0.0007233 (57)	total: 974ms	remaining: 4.06s
58:	learn: 0.0007518	test: 0.0007518	best: 0.0007518 (58)	total: 989ms	remaining: 4.04s
59:	learn: 0.0007750	test: 0.0007750	best: 0.0007750 (59)	total: 1s	remaining: 4.01s
60:	learn: 0.0008037	test: 0.0008037	best: 0.0008037 (60)	total: 1.02s	remaining: 3.98s
:
:
:
292:	learn: 0.0935045	test: 0.0935045	best: 0.0935045 (292)	total: 4.52s	remaining: 108ms
293:	learn: 0.0941109	test: 0.0941109	best: 0.0941109 (293)	total: 4.53s	remaining: 92.5ms
294:	learn: 0.0950747	test: 0.0950747	best: 0.0950747 (294)	total: 4.54s	remaining: 77ms
295:	learn: 0.0959342	test: 0.0959342	best: 0.0959342 (295)	total: 4.56s	remaining: 61.6ms
296:	learn: 0.0968590	test: 0.0968590	best: 0.0968590 (296)	total: 4.57s	remaining: 46.2ms
297:	learn: 0.0976611	test: 0.0976611	best: 0.0976611 (297)	total: 4.59s	remaining: 30.8ms
298:	learn: 0.0982952	test: 0.0982952	best: 0.0982952 (298)	total: 4.6s	remaining: 15.4ms
299:	learn: 0.0989027	test: 0.0989027	best: 0.0989027 (299)	total: 4.62s	remaining: 0us
bestTest = 0.09890269159
bestIteration = 299
Out[11]:
<catboost.core.CatBoostRegressor at 0x24c9cb3af28>
In [12]:
# Mapping Feature Importance
plt.figure()
fea_imp = pd.DataFrame({'imp': model.feature_importances_, 'col': train_x.columns})
fea_imp['imp'] = round(fea_imp.imp, 2)
fea_imp = fea_imp.sort_values(['imp', 'col'], ascending=[True, False]).iloc[-30:]
_ = fea_imp.plot(kind='barh', x='col', y='imp', figsize=(4, 4))
plt.title('Var Imp in Predicting AGE from CatBoost')
plt.show()
fea_imp1 = fea_imp.sort_values('imp', ascending=False)
# add row total
fea_imp1['cum_sum_imp']= round(fea_imp1['imp'].cumsum(),2)
fea_imp1.loc['row_total'] = fea_imp.apply(lambda x: x.sum())
fea_imp1
<Figure size 432x288 with 0 Axes>
Out[12]:
colimpcum_sum_imp
0height52.5152.51
3amt_book27.4379.94
1weight14.0593.99
2amt_strbk6.01100.00
row_totalamt_strbkweightamt_bookheight100.00NaN
In [13]:
# 이제 핵심변수 두 가지를 중심으로 Scatter Plot을 그린다
# 일정한 규칙으로 X, Y 축을 정한다 (예: 중요도 1위 변수를 항상 X축에, 2등은 Y축에)
median_age = dff01.age.median()
colors1 = ['red' if x>median_age else 'blue' for x in dff01.age]
plt.scatter(rjitt(dff01.height), rjitt(dff01.amt_book), 
            s=5, alpha=0.2,
           color=colors1)
plt.xlabel('HEIGHT')
plt.ylabel('AMT BOOK')
plt.title('blue: young ; orange line:20%')
plt.axvline(np.percentile(dff01.height,80), 
            color='orange',linestyle=':')
plt.axhline(np.percentile(dff01.amt_book,80), 
            color='orange',linestyle=':')
plt.show()

[3] 세개의 변수를 한 Set으로 (그리고 3 변수 까지만. 그리고, No 3D plot)


? 3d plot은 아무 것도 보이지 않으니까~

In [16]:
# 이 번에는 target인 age를 연속적으로 표시 (red~blue)
# scale age
rage = mnmx_scl(dff01.age)
colors1 = [(x, 0, 1-x) for x in rage]
plt.figure(figsize=(20, 3))
plt.subplot(141)
# 일정한 규칙으로 X, Y 축을 정한다 (예: 중요도 1위 변수를 항상 X축에, 2등은 Y축에)
# 첫번째 그림
plt.scatter(rjitt(dff01.height), rjitt(dff01.amt_book), 
            s=5, alpha=0.1,
           color=colors1)
plt.xlabel('HEIGHT')
plt.ylabel('AMT BOOK')
plt.title('blue: young ; orange line:20%')
plt.axvline(np.percentile(dff01.height,80), 
            color='orange',linestyle=':')
plt.axhline(np.percentile(dff01.amt_book,80), 
            color='orange',linestyle=':')
# 두번째 그림
plt.subplot(142)
plt.scatter(rjitt(dff01.weight), rjitt(dff01.amt_book), 
            s=5, alpha=0.3,
           color=colors1)
plt.xlabel('WEIGHT')
plt.ylabel('AMT BOOK')
plt.title('blue: young ; orange line:20%')
plt.axvline(np.percentile(dff01.weight,80), 
            color='orange',linestyle=':')
plt.axhline(np.percentile(dff01.amt_book,80), 
            color='orange',linestyle=':')
# 마지막 그림
plt.subplot(143)
plt.scatter(rjitt(dff01.height), rjitt(dff01.weight), 
            s=5, alpha=0.3,
           color=colors1)
plt.xlabel('HEIGHT')
plt.ylabel('WEIGHT')
plt.title('blue: young ; orange line:20%')
plt.axvline(np.percentile(dff01.height,80), 
            color='orange',linestyle=':')
plt.axhline(np.percentile(dff01.weight,80), 
            color='orange',linestyle=':')
plt.show()

3차원 플롯을 그리는 것이야 어렵지 않겠으나 그린 결과에서 눈에 패턴이 선명하게 보이지 않는다
--> 회전을 시킨다고 잘 보일 것이라는 기대는 흔히 깨진다

In [15]:
# 3d scatter plotting 
from mpl_toolkits import mplot3d
fig = plt.figure(figsize=(8,5))
ax = plt.axes(projection="3d")
ax.scatter3D(rjitt(dff01.height), rjitt(dff01.amt_book), rjitt(dff01.weight), 
             alpha=0.2, s=5, color=colors1)
plt.xlabel('HEIGHT')
plt.ylabel('AMT BOOK')
plt.title('blue: young')
plt.show()

Recap - 지극히 주관적인 EDA 실전 Tips


[전용준. 리비젼컨설팅. 머신러닝]

? [1] 익숙한 하나의 라이브러리를 집중적으로 사용한다 (esp. plt.scatter )
? [2] 많은 변수가 있다면 핵심변수에 우선 집중한다 (esp. Target이 있는 경우)
? [3] 세개의 변수를 한 Set으로 (그리고 3 변수 까지만. 그리고, No 3D plot)
Plus
? 실전에서 그림이 예쁜지는 중요 X
? 초기 EDA는 보고용이 아니라 분석용
? 극명한 패턴을 빠르게 찾아내야


- 끝 -