Sample EDA Visual Analysis using Pandas and matplolib scatter plot

Exploratory Visual Data Analysis using Pandas and matplolib

[1] import libraries and [2] import sample data set for analysis from a blog website link (from thg "car" package in R)

- the data set is on self-reported and measured weight and height of students

In [5]:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from numpy.polynomial.polynomial import polyfit
# Read data
d1 = pd.read_csv("https://t1.daumcdn.net/cfile/blog/99C08C4D5BC4362B22?download")
d1.head()

Out[5]:

	Unnamed: 0	sex	weight	height	repwt	repht
0	1	M	77	182	77.0	180.0
1	2	F	58	161	51.0	159.0
2	3	F	53	161	54.0	158.0
3	4	M	68	177	70.0	175.0
4	5	F	59	157	59.0	155.0

create a vector of color code representing sex

In [20]:

colors =[]
colors = [1 if  i == "M" else 0 for i in d1['sex']]

Create a scatter plot showing actual height and weight

In [21]:

plt.scatter(d1['weight'], d1['height'], alpha=0.5,
           c=colors)
# plt.show()            
# add title and labels
plt.title("Weight and Height of Students")
plt.xlabel("Weight")
plt.ylabel("Height")

Out[21]:

Text(0,0.5,'Height')

add regression line to show the global trend but we find an abnormal data point (height 60 and weight 160) ==> is she a normal human?

In [22]:

# add regression line 
b, m = polyfit(d1['weight'], d1['height'], 1)
plt.plot(d1['weight'], b + m * d1['weight'], '-')
plt.grid()
plt.show()

retrieve the student under 100 in height

In [27]:

d1.loc[d1['height']<100,]

Out[27]:

	Unnamed: 0	sex	weight	height	repwt	repht
11	12	F	166	57	56.0	163.0

considering reported weight and reported height, we find there was a data entry mistake correction should be made

In [40]:

d1.loc[d1['height']<100] = 12,"F",57,166,56,163

create a scatter plot with the corrected data set. now it looks okay! :-)

In [41]:

plt.scatter(d1['weight'], d1['height'], alpha=0.5,
           c=colors)
# plt.show()            
# add title and labels
plt.title("Weight and Height of Students")
plt.xlabel("Weight")
plt.ylabel("Height")
b, m = polyfit(d1['weight'], d1['height'], 1)
plt.plot(d1['weight'], b + m * d1['weight'], '-')
plt.grid()
plt.show()

The goal of EDA is not just drawing a chart. Understanding data distribution and discovering errors are also important in an EDA phase.

carData_Davis.csv

0.0MB

저작자표시 비영리 변경금지 (새창열림)