Create a scatter plot showing actual height and weight
In [21]:
plt.scatter(d1['weight'],d1['height'],alpha=0.5,c=colors)# plt.show() # add title and labelsplt.title("Weight and Height of Students")plt.xlabel("Weight")plt.ylabel("Height")
Out[21]:
Text(0,0.5,'Height')
add regression line to show the global trend but we find an abnormal data point (height 60 and weight 160) ==> is she a normal human?
In [22]:
# add regression line b,m=polyfit(d1['weight'],d1['height'],1)plt.plot(d1['weight'],b+m*d1['weight'],'-')plt.grid()plt.show()
retrieve the student under 100 in height
In [27]:
d1.loc[d1['height']<100,]
Out[27]:
Unnamed: 0
sex
weight
height
repwt
repht
11
12
F
166
57
56.0
163.0
considering reported weight and reported height, we find there was a data entry mistake correction should be made
In [40]:
d1.loc[d1['height']<100]=12,"F",57,166,56,163
create a scatter plot with the corrected data set. now it looks okay! :-)
In [41]:
plt.scatter(d1['weight'],d1['height'],alpha=0.5,c=colors)# plt.show() # add title and labelsplt.title("Weight and Height of Students")plt.xlabel("Weight")plt.ylabel("Height")b,m=polyfit(d1['weight'],d1['height'],1)plt.plot(d1['weight'],b+m*d1['weight'],'-')plt.grid()plt.show()
The goal of EDA is not just drawing a chart. Understanding data distribution and discovering errors are also important in an EDA phase.