>RE::VISION CRM

Python데이터분석

[데이터분석] Davis EDA

YONG_X 2018. 10. 15. 15:39


carData_Davis.csv


Sample EDA Visual Analysis using Pandas and matplolib scatter plot

Exploratory Visual Data Analysis using Pandas and matplolib

[1] import libraries and [2] import sample data set for analysis from a blog website link (from thg "car" package in R)

- the data set is on self-reported and measured weight and height of students
In [5]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from numpy.polynomial.polynomial import polyfit
# Read data
d1 = pd.read_csv("https://t1.daumcdn.net/cfile/blog/99C08C4D5BC4362B22?download")
d1.head()
Out[5]:
Unnamed: 0sexweightheightrepwtrepht
01M7718277.0180.0
12F5816151.0159.0
23F5316154.0158.0
34M6817770.0175.0
45F5915759.0155.0

create a vector of color code representing sex

In [20]:
colors =[]
colors = [1 if  i == "M" else 0 for i in d1['sex']]

Create a scatter plot showing actual height and weight

In [21]:
plt.scatter(d1['weight'], d1['height'], alpha=0.5,
           c=colors)
# plt.show()            
# add title and labels
plt.title("Weight and Height of Students")
plt.xlabel("Weight")
plt.ylabel("Height")
Out[21]:
Text(0,0.5,'Height')

add regression line to show the global trend but we find an abnormal data point (height 60 and weight 160) ==> is she a normal human?

In [22]:
# add regression line 
b, m = polyfit(d1['weight'], d1['height'], 1)
plt.plot(d1['weight'], b + m * d1['weight'], '-')
plt.grid()
plt.show()

retrieve the student under 100 in height

In [27]:
d1.loc[d1['height']<100,] 
Out[27]:
Unnamed: 0sexweightheightrepwtrepht
1112F1665756.0163.0

considering reported weight and reported height, we find there was a data entry mistake correction should be made

In [40]:
d1.loc[d1['height']<100] = 12,"F",57,166,56,163

create a scatter plot with the corrected data set. now it looks okay! :-)

In [41]:
plt.scatter(d1['weight'], d1['height'], alpha=0.5,
           c=colors)
# plt.show()            
# add title and labels
plt.title("Weight and Height of Students")
plt.xlabel("Weight")
plt.ylabel("Height")
b, m = polyfit(d1['weight'], d1['height'], 1)
plt.plot(d1['weight'], b + m * d1['weight'], '-')
plt.grid()
plt.show()

The goal of EDA is not just drawing a chart. Understanding data distribution and discovering errors are also important in an EDA phase.



carData_Davis.csv
0.0MB