Python데이터분석

파이썬: 실전팁 for EDA [전용준.리비젼컨설팅]

YONG_X 2019. 8. 27. 14:50

파이썬: 실전팁 for EDA [전용준.리비젼컨설팅]

(지극히 주관적인...) 실전에서의 탐색적 데이터 분석 방법 (Tips) 몇 가지

요점은 ....

? [1] 익숙한 하나의 라이브러리를 집중적으로 사용한다 (esp. plt.scatter )

? [2] 많은 변수가 있다면 핵심변수에 우선 집중한다 (esp. Target이 있는 경우)

? [3] 세개의 변수를 한 Set으로 (그리고 3 변수 까지만. 그리고, No 3D plot)

__Plus__

? 실전에서 그림이 예쁜지는 중요 X

? 초기 EDA는 보고용이 아니라 분석용

? 극명한 패턴을 빠르게 찾아내야

[ #머신러닝 #EDA #파이썬 #전용준 #리비젼 #리비젼컨설팅

#탐색적데이터분석 #데이터분석 #python #visualization

#scatter #decisiontree #catboost #variableimportance #변수중요도 ]

머신러닝, EDA, 파이썬, 전용준, 리비젼, 리비젼컨설팅, 탐색적데이터분석, 데이터분석, python, visualization, scatter, decisiontree, catboost, variable importance, 변수중요도

[ 유튜브 영상 ]

How I Do EDA :: Retail Customer Analysis -- BuyIt.com¶

In [1]:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from numpy.polynomial.polynomial import polyfit
import matplotlib.style as style 
from IPython.display import Image
import warnings
warnings.filterwarnings('ignore')
# define random jitter
def rjitt(arr):
    stdev = .01*(max(arr)-min(arr))
    return arr + np.random.randn(len(arr)) * stdev
dataPath = 'C:/YONG/m1710/myPydata/'
def rjitt2(arr):
    stdev = .031*(max(arr)-min(arr))
    return arr + np.random.randn(len(arr)) * stdev
# min max scaler
def mnmx_scl(vec):
    vec = (vec-vec.min())/(vec.max()-vec.min())
    return(vec)
def mnmx_scl2(vec):
    vec = np.where(np.percentile(vec, 95) < vec,
                   np.percentile(vec, 95), vec)
    vec = np.array(vec)
    vec = (vec-vec.min())/(vec.max()-vec.min())
    return(vec)

A Practitioner's Tips for EDA - How I Do EDA¶

[전용준. 리비젼컨설팅. 머신러닝]

? [1] 익숙한 하나의 라이브러리를 집중적으로 사용한다 (esp. plt.scatter )
? [2] 많은 변수가 있다면 핵심변수에 우선 집중한다 (esp. Target이 있는 경우)
? [3] 세개의 변수를 한 Set으로 (그리고 3 변수 까지만. 그리고, No 3D plot)

In [2]:

# 블로그로부터 연습용 CSV 형식의 데이터 불러오기
dff01 = pd.read_csv('https://t1.daumcdn.net/cfile/blog/992CFF3B5D5CC70C2C?download')
print('Shape of the dataset : ', dff01.shape)
dff01.head()

Shape of the dataset :  (5000, 5)

Out[2]:

	age	height	weight	amt_strbk
0	28	157	52	22300
1	28	154	47	35100
2	28	155	52	21300
3	27	155	44	0
4	28	155	51	17500

[1] 익숙한 하나의 라이브러리를 집중적으로 사용한다 (esp. plt.scatter() )¶

In [3]:

# matplotlib scatter 사용
plt.scatter(dff01.amt_strbk, dff01.amt_book)
plt.xlabel('AMT STARBUCKS')
plt.ylabel('AMT BOOK')
plt.show()

In [4]:

# alpha 투명도 옵션을 활용
plt.scatter(dff01.amt_strbk, dff01.amt_book, s=5, alpha=0.2)
plt.xlabel('AMT STARBUCKS')
plt.ylabel('AMT BOOK')
plt.show()

In [5]:

# matplotlib scatter 사용
plt.scatter(rjitt(dff01.amt_strbk), rjitt(dff01.amt_book), s=5, alpha=0.2)
plt.xlabel('AMT STARBUCKS')
plt.ylabel('AMT BOOK')
plt.show()

In [6]:

# target이 있다면 target을 표시
median_age = dff01.age.median()
colors1 = ['red' if x>median_age else 'blue' for x in dff01.age]
plt.scatter(rjitt(dff01.amt_strbk), rjitt(dff01.amt_book), 
            s=5, alpha=0.2,
           color=colors1)
plt.xlabel('AMT STARBUCKS')
plt.ylabel('AMT BOOK')
plt.title('blue: young')
plt.show()

? matplotlib도 seaborn도 pandas도 좋다. 문제는 우왕좌왕하면서 구글검색에 시간을 낭비할 필요가 없다는 것

In [7]:

# 보조선은 시각적인 판단을 빠르게 하는데 크게 기여한다
median_age = dff01.age.median()
colors1 = ['red' if x>median_age else 'blue' for x in dff01.age]
plt.scatter(rjitt(dff01.amt_strbk), rjitt(dff01.amt_book), 
            s=5, alpha=0.2,
           color=colors1)
plt.xlabel('AMT STARBUCKS')
plt.ylabel('AMT BOOK')
plt.title('blue: young ; orange line:20%')
plt.axvline(np.percentile(dff01.amt_strbk,80), 
            color='orange',linestyle=':')
plt.axhline(np.percentile(dff01.amt_book,80), 
            color='orange',linestyle=':')
plt.show()

Findings :¶

? 서적 구매가 전혀 없는 고객들이 다수
? 스타벅스 구매는 대부분의 고객들이 약간이라도 있음
? 젊은 여성쪽에서 스타벅스와 서적 모두의 구매금액이 큰 경우 많음
? 스타벅스와 서적 모두 구매한 집단에서는 두 품목간 양의 상관관계 존재하는듯

[2] 많은 변수가 있다면 핵심변수에 우선 집중한다 (esp. Target이 있는 경우)¶

? 변수가 여러개라면 DT나 RF(?), CatBoost로 변수 중요도를 구한다 (아직은 __모델__이 아니다. 변수 탐색을 위한 보조Tool일 뿐)
? RF를 사용하게되면 Categorical Variable을 encoding하는 번거러움이 있으니 __CatBoost__가 편하다

In [8]:

# searching for key variables using DT
from sklearn.tree import DecisionTreeRegressor
dt1 = DecisionTreeRegressor(min_samples_split=50, max_depth=3, min_samples_leaf=10, random_state=99)
dtfeatures = ['height','weight','amt_strbk','amt_book']
tgt = 'age'
dfdt = dff01
dt1.fit(dfdt[dtfeatures], dfdt[tgt])
# path should have set
# or [ InvocationException: GraphViz's executables not found ] occur 
import os     
os.environ["PATH"] += os.pathsep + 'C:\\Program Files\\Anaconda3\\Library\\bin\\graphviz'
from sklearn import tree
from IPython.display import Image  
import pydotplus
import graphviz
print('variable importance : ', dict(zip(dfdt[dtfeatures].columns, dt1.feature_importances_)))
# Create DOT data
dot_data = tree.export_graphviz(dt1, out_file=None, 
                                feature_names=dtfeatures)
# Draw graph
graph = pydotplus.graph_from_dot_data(dot_data)  
# Show graph
Image(graph.create_png())

variable importance :  {'amt_strbk': 0.07748767837764613, 'amt_book': 0.19937549317012487, 'weight': 0.0, 'height': 0.7231368284522289}

Out[8]:

In [9]:

# searching for key variables
# A deeper and bush tree can show more info on Var Imp
from sklearn.tree import DecisionTreeRegressor
dt1 = DecisionTreeRegressor(min_samples_split=30, max_depth=5, 
                            min_samples_leaf=5, random_state=99)
dtfeatures = ['height','weight','amt_strbk','amt_book']
tgt = 'age'
dfdt = dff01
dt1.fit(dfdt[dtfeatures], dfdt[tgt])
# path should have set
# or [ InvocationException: GraphViz's executables not found ] occur 
import os     
os.environ["PATH"] += os.pathsep + 'C:\\Program Files\\Anaconda3\\Library\\bin\\graphviz'
from sklearn import tree
from IPython.display import Image  
import pydotplus
import graphviz
print('variable importance : ', dict(zip(dfdt[dtfeatures].columns, dt1.feature_importances_)))
# Create DOT data
dot_data = tree.export_graphviz(dt1, out_file=None, 
                                feature_names=dtfeatures)
# Draw graph
graph = pydotplus.graph_from_dot_data(dot_data)  
# Show graph
Image(graph.create_png())

variable importance :  {'amt_strbk': 0.10799550290753211, 'amt_book': 0.22525717355765157, 'weight': 0.036030155469811656, 'height': 0.6307171680650047}

Out[9]:

In [10]:

d = dict(zip(dfdt[dtfeatures].columns, dt1.feature_importances_))
d.items()
ddf = pd.DataFrame(pd.Series(d, name='Importance')).reset_index().sort_values('Importance')
ddf.columns = ['VarName', 'Importance']
ddf
plt.barh(ddf['VarName'], ddf['Importance'], color='navy')
plt.title('Features Predicting AGE')
plt.xlabel('Relative Feature Importance from DT')
plt.show()

? __CatBoost__는 여러개의 트리를 생성하기 때문에 보다 안정적인 (=믿음직스런) 결과를 준다
? CatBoost는 한 두개 핵심변수 이외의 변수들의 중요도도 나름 공평하게 인정해준다.
(핵심변수의 영향을 과대평가하는 것을 방지하는데 도움이 된다. 다만, 실행에 조금 더 많은 리소스가 필요하다)

In [11]:

from catboost import CatBoostRegressor
model = CatBoostRegressor(iterations=300, depth=3, learning_rate=0.01,
                           eval_metric='R2', 
                           use_best_model=True,
                           random_seed=42)
train_x = dfdt[dtfeatures] 
tgt = 'age'
train_y = dfdt[tgt]
model.fit(
    train_x, train_y,
    # cat_features=categorical_features_indices,
    # verbose=True,  # you can uncomment this for text output
    # plot=True, # does not work for classifier
    eval_set=(train_x, train_y)    
)

0:	learn: -0.0000008	test: -0.0000008	best: -0.0000008 (0)	total: 86.7ms	remaining: 25.9s
1:	learn: 0.0000145	test: 0.0000145	best: 0.0000145 (1)	total: 104ms	remaining: 15.6s
2:	learn: 0.0000157	test: 0.0000157	best: 0.0000157 (2)	total: 124ms	remaining: 12.3s
3:	learn: 0.0000245	test: 0.0000245	best: 0.0000245 (3)	total: 139ms	remaining: 10.3s
4:	learn: 0.0000275	test: 0.0000275	best: 0.0000275 (4)	total: 153ms	remaining: 9.04s
5:	learn: 0.0000343	test: 0.0000343	best: 0.0000343 (5)	total: 167ms	remaining: 8.18s
6:	learn: 0.0000378	test: 0.0000378	best: 0.0000378 (6)	total: 181ms	remaining: 7.59s
7:	learn: 0.0000520	test: 0.0000520	best: 0.0000520 (7)	total: 196ms	remaining: 7.14s
8:	learn: 0.0000648	test: 0.0000648	best: 0.0000648 (8)	total: 210ms	remaining: 6.78s
9:	learn: 0.0000703	test: 0.0000703	best: 0.0000703 (9)	total: 224ms	remaining: 6.5s
10:	learn: 0.0000746	test: 0.0000746	best: 0.0000746 (10)	total: 238ms	remaining: 6.25s
11:	learn: 0.0000823	test: 0.0000823	best: 0.0000823 (11)	total: 252ms	remaining: 6.04s
:

:
24:	learn: 0.0002110	test: 0.0002110	best: 0.0002110 (24)	total: 454ms	remaining: 4.99s
25:	learn: 0.0002192	test: 0.0002192	best: 0.0002192 (25)	total: 469ms	remaining: 4.94s
26:	learn: 0.0002348	test: 0.0002348	best: 0.0002348 (26)	total: 486ms	remaining: 4.92s
27:	learn: 0.0002588	test: 0.0002588	best: 0.0002588 (27)	total: 507ms	remaining: 4.93s
28:	learn: 0.0002663	test: 0.0002663	best: 0.0002663 (28)	total: 523ms	remaining: 4.89s
29:	learn: 0.0002720	test: 0.0002720	best: 0.0002720 (29)	total: 538ms	remaining: 4.84s
30:	learn: 0.0002834	test: 0.0002834	best: 0.0002834 (30)	total: 552ms	remaining: 4.79s
31:	learn: 0.0002998	test: 0.0002998	best: 0.0002998 (31)	total: 567ms	remaining: 4.75s
32:	learn: 0.0003047	test: 0.0003047	best: 0.0003047 (32)	total: 580ms	remaining: 4.7s
33:	learn: 0.0003309	test: 0.0003309	best: 0.0003309 (33)	total: 595ms	remaining: 4.65s
34:	learn: 0.0003503	test: 0.0003503	best: 0.0003503 (34)	total: 609ms	remaining: 4.61s
35:	learn: 0.0003794	test: 0.0003794	best: 0.0003794 (35)	total: 623ms	remaining: 4.57s
36:	learn: 0.0003900	test: 0.0003900	best: 0.0003900 (36)	total: 639ms	remaining: 4.54s
37:	learn: 0.0004000	test: 0.0004000	best: 0.0004000 (37)	total: 656ms	remaining: 4.52s
38:	learn: 0.0004141	test: 0.0004141	best: 0.0004141 (38)	total: 672ms	remaining: 4.49s
39:	learn: 0.0004260	test: 0.0004260	best: 0.0004260 (39)	total: 691ms	remaining: 4.49s
40:	learn: 0.0004466	test: 0.0004466	best: 0.0004466 (40)	total: 715ms	remaining: 4.51s
41:	learn: 0.0004641	test: 0.0004641	best: 0.0004641 (41)	total: 734ms	remaining: 4.5s
42:	learn: 0.0004782	test: 0.0004782	best: 0.0004782 (42)	total: 749ms	remaining: 4.47s
43:	learn: 0.0004882	test: 0.0004882	best: 0.0004882 (43)	total: 763ms	remaining: 4.44s
44:	learn: 0.0005046	test: 0.0005046	best: 0.0005046 (44)	total: 777ms	remaining: 4.4s
45:	learn: 0.0005136	test: 0.0005136	best: 0.0005136 (45)	total: 791ms	remaining: 4.37s
46:	learn: 0.0005242	test: 0.0005242	best: 0.0005242 (46)	total: 805ms	remaining: 4.33s
47:	learn: 0.0005361	test: 0.0005361	best: 0.0005361 (47)	total: 819ms	remaining: 4.3s
48:	learn: 0.0005556	test: 0.0005556	best: 0.0005556 (48)	total: 833ms	remaining: 4.26s
49:	learn: 0.0005711	test: 0.0005711	best: 0.0005711 (49)	total: 846ms	remaining: 4.23s
50:	learn: 0.0005877	test: 0.0005877	best: 0.0005877 (50)	total: 861ms	remaining: 4.2s
51:	learn: 0.0006119	test: 0.0006119	best: 0.0006119 (51)	total: 875ms	remaining: 4.17s
52:	learn: 0.0006263	test: 0.0006263	best: 0.0006263 (52)	total: 893ms	remaining: 4.16s
53:	learn: 0.0006436	test: 0.0006436	best: 0.0006436 (53)	total: 914ms	remaining: 4.16s
54:	learn: 0.0006615	test: 0.0006615	best: 0.0006615 (54)	total: 930ms	remaining: 4.14s
55:	learn: 0.0006751	test: 0.0006751	best: 0.0006751 (55)	total: 945ms	remaining: 4.12s
56:	learn: 0.0006956	test: 0.0006956	best: 0.0006956 (56)	total: 959ms	remaining: 4.09s
57:	learn: 0.0007233	test: 0.0007233	best: 0.0007233 (57)	total: 974ms	remaining: 4.06s
58:	learn: 0.0007518	test: 0.0007518	best: 0.0007518 (58)	total: 989ms	remaining: 4.04s
59:	learn: 0.0007750	test: 0.0007750	best: 0.0007750 (59)	total: 1s	remaining: 4.01s
60:	learn: 0.0008037	test: 0.0008037	best: 0.0008037 (60)	total: 1.02s	remaining: 3.98s
:

:
292:	learn: 0.0935045	test: 0.0935045	best: 0.0935045 (292)	total: 4.52s	remaining: 108ms
293:	learn: 0.0941109	test: 0.0941109	best: 0.0941109 (293)	total: 4.53s	remaining: 92.5ms
294:	learn: 0.0950747	test: 0.0950747	best: 0.0950747 (294)	total: 4.54s	remaining: 77ms
295:	learn: 0.0959342	test: 0.0959342	best: 0.0959342 (295)	total: 4.56s	remaining: 61.6ms
296:	learn: 0.0968590	test: 0.0968590	best: 0.0968590 (296)	total: 4.57s	remaining: 46.2ms
297:	learn: 0.0976611	test: 0.0976611	best: 0.0976611 (297)	total: 4.59s	remaining: 30.8ms
298:	learn: 0.0982952	test: 0.0982952	best: 0.0982952 (298)	total: 4.6s	remaining: 15.4ms
299:	learn: 0.0989027	test: 0.0989027	best: 0.0989027 (299)	total: 4.62s	remaining: 0us
bestTest = 0.09890269159
bestIteration = 299

Out[11]:

<catboost.core.CatBoostRegressor at 0x24c9cb3af28>

In [12]:

# Mapping Feature Importance
plt.figure()
fea_imp = pd.DataFrame({'imp': model.feature_importances_, 'col': train_x.columns})
fea_imp['imp'] = round(fea_imp.imp, 2)
fea_imp = fea_imp.sort_values(['imp', 'col'], ascending=[True, False]).iloc[-30:]
_ = fea_imp.plot(kind='barh', x='col', y='imp', figsize=(4, 4))
plt.title('Var Imp in Predicting AGE from CatBoost')
plt.show()
fea_imp1 = fea_imp.sort_values('imp', ascending=False)
# add row total
fea_imp1['cum_sum_imp']= round(fea_imp1['imp'].cumsum(),2)
fea_imp1.loc['row_total'] = fea_imp.apply(lambda x: x.sum())
fea_imp1

<Figure size 432x288 with 0 Axes>

Out[12]:

	col	imp	cum_sum_imp
0	height	52.51	52.51
3	amt_book	27.43	79.94
1	weight	14.05	93.99
2	amt_strbk	6.01	100.00
row_total	amt_strbkweightamt_bookheight	100.00	NaN

In [13]:

# 이제 핵심변수 두 가지를 중심으로 Scatter Plot을 그린다
# 일정한 규칙으로 X, Y 축을 정한다 (예: 중요도 1위 변수를 항상 X축에, 2등은 Y축에)
median_age = dff01.age.median()
colors1 = ['red' if x>median_age else 'blue' for x in dff01.age]
plt.scatter(rjitt(dff01.height), rjitt(dff01.amt_book), 
            s=5, alpha=0.2,
           color=colors1)
plt.xlabel('HEIGHT')
plt.ylabel('AMT BOOK')
plt.title('blue: young ; orange line:20%')
plt.axvline(np.percentile(dff01.height,80), 
            color='orange',linestyle=':')
plt.axhline(np.percentile(dff01.amt_book,80), 
            color='orange',linestyle=':')
plt.show()

[3] 세개의 변수를 한 Set으로 (그리고 3 변수 까지만. 그리고, No 3D plot)¶

? 3d plot은 아무 것도 보이지 않으니까~

In [16]:

# 이 번에는 target인 age를 연속적으로 표시 (red~blue)
# scale age
rage = mnmx_scl(dff01.age)
colors1 = [(x, 0, 1-x) for x in rage]
plt.figure(figsize=(20, 3))
plt.subplot(141)
# 일정한 규칙으로 X, Y 축을 정한다 (예: 중요도 1위 변수를 항상 X축에, 2등은 Y축에)
# 첫번째 그림
plt.scatter(rjitt(dff01.height), rjitt(dff01.amt_book), 
            s=5, alpha=0.1,
           color=colors1)
plt.xlabel('HEIGHT')
plt.ylabel('AMT BOOK')
plt.title('blue: young ; orange line:20%')
plt.axvline(np.percentile(dff01.height,80), 
            color='orange',linestyle=':')
plt.axhline(np.percentile(dff01.amt_book,80), 
            color='orange',linestyle=':')
# 두번째 그림
plt.subplot(142)
plt.scatter(rjitt(dff01.weight), rjitt(dff01.amt_book), 
            s=5, alpha=0.3,
           color=colors1)
plt.xlabel('WEIGHT')
plt.ylabel('AMT BOOK')
plt.title('blue: young ; orange line:20%')
plt.axvline(np.percentile(dff01.weight,80), 
            color='orange',linestyle=':')
plt.axhline(np.percentile(dff01.amt_book,80), 
            color='orange',linestyle=':')
# 마지막 그림
plt.subplot(143)
plt.scatter(rjitt(dff01.height), rjitt(dff01.weight), 
            s=5, alpha=0.3,
           color=colors1)
plt.xlabel('HEIGHT')
plt.ylabel('WEIGHT')
plt.title('blue: young ; orange line:20%')
plt.axvline(np.percentile(dff01.height,80), 
            color='orange',linestyle=':')
plt.axhline(np.percentile(dff01.weight,80), 
            color='orange',linestyle=':')
plt.show()

3차원 플롯을 그리는 것이야 어렵지 않겠으나 그린 결과에서 눈에 패턴이 선명하게 보이지 않는다
--> 회전을 시킨다고 잘 보일 것이라는 기대는 흔히 깨진다

In [15]:

# 3d scatter plotting 
from mpl_toolkits import mplot3d
fig = plt.figure(figsize=(8,5))
ax = plt.axes(projection="3d")
ax.scatter3D(rjitt(dff01.height), rjitt(dff01.amt_book), rjitt(dff01.weight), 
             alpha=0.2, s=5, color=colors1)
plt.xlabel('HEIGHT')
plt.ylabel('AMT BOOK')
plt.title('blue: young')
plt.show()

Recap - 지극히 주관적인 EDA 실전 Tips¶

[전용준. 리비젼컨설팅. 머신러닝]

? [1] 익숙한 하나의 라이브러리를 집중적으로 사용한다 (esp. plt.scatter )
? [2] 많은 변수가 있다면 핵심변수에 우선 집중한다 (esp. Target이 있는 경우)
? [3] 세개의 변수를 한 Set으로 (그리고 3 변수 까지만. 그리고, No 3D plot)
Plus
? 실전에서 그림이 예쁜지는 중요 X
? 초기 EDA는 보고용이 아니라 분석용
? 극명한 패턴을 빠르게 찾아내야

- 끝 -

저작자표시 비영리 변경금지 (새창열림)

'Python데이터분석' 카테고리의 다른 글

py EDA practice (0)	2019.10.29
[ssfc_pda01] Class 분석 프로젝트 진행방안 (0)	2019.10.22
[파이썬] 히트맵 heatmap 을 활용한 탐색적 분석 예제 (0)	2019.08.21
[파이썬] Numpy와 Pandas 구글 검색 지수 추이 비교 - 시각화 (0)	2019.08.08
[Python 분석]온라인 서점 고객세분화 Visual Data Exploration 예제 (0)	2019.07.18

현재글파이썬: 실전팁 for EDA [전용준.리비젼컨설팅]

리비젼 CRM ( revisioncrm )

프롬프트, R, 비즈니스프롬프트엔지니어링, 리비젼컨설팅, chatGPT, 전용준 빅데이터, 인공지능, CRM, GPT, 빅데이터, 챗GPT, 데이터 분석, 데이터분석, 리비젼, AI, 전용준, 빅 데이터, 디지털마케팅, 머신러닝, 프롬프트엔지니어링,

Today :
Yesterday :

파이썬: 실전팁 for EDA [전용준.리비젼컨설팅]

How I Do EDA :: Retail Customer Analysis -- BuyIt.com¶

A Practitioner's Tips for EDA - How I Do EDA¶

[1] 익숙한 하나의 라이브러리를 집중적으로 사용한다 (esp. plt.scatter() )¶

Findings :¶

[2] 많은 변수가 있다면 핵심변수에 우선 집중한다 (esp. Target이 있는 경우)¶

[3] 세개의 변수를 한 Set으로 (그리고 3 변수 까지만. 그리고, No 3D plot)¶

Recap - 지극히 주관적인 EDA 실전 Tips¶

'Python데이터분석' 카테고리의 다른 글

'Python데이터분석'의 다른글

티스토리툴바

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

파이썬: 실전팁 for EDA [전용준.리비젼컨설팅]

How I Do EDA :: Retail Customer Analysis -- BuyIt.com¶

A Practitioner's Tips for EDA - How I Do EDA¶

[1] 익숙한 하나의 라이브러리를 집중적으로 사용한다 (esp. plt.scatter() )¶

Findings :¶

[2] 많은 변수가 있다면 핵심변수에 우선 집중한다 (esp. Target이 있는 경우)¶

[3] 세개의 변수를 한 Set으로 (그리고 3 변수 까지만. 그리고, No 3D plot)¶

Recap - 지극히 주관적인 EDA 실전 Tips¶

'Python데이터분석' 카테고리의 다른 글

'Python데이터분석'의 다른글

관련글

티스토리툴바