範例 1： Given and , compute and where
Note:
import numpy as np
x = np.array([6, 3, 5, 8, 7])
y = np.array([2, 4, 3, 7, 6])
sum_x_formula = x[0] + x[1] + x[2] + x[3] + x[4]
sum_x = x.sum() # numpy data's method
n = len(x) # sample size
x_bar_formula = sum_x / n
x_bar = x.mean()
var_x_formula_1 = ((x - x_bar)**2).sum() / (n-1) # element by element operation
var_x_formula_2 = np.sum((x - x_bar)**2) / (n-1)
var_x = x.var( ddof = 1) # ddof = 1: divided by N-1
std_x_formula = np.sqrt(var_x)
std_x = x.std( ddof = 1)
y_bar = y.mean()
std_y = y.std( ddof = 1)
print('The sample mean of x is {:.4f}'.format(x_bar))
print('The sample standard deviation of x is %.6f' %std_x)
練習： Compute the zscore for the sample where
for
Note:
import numpy as np
from scipy.stats import zscore
x = np.array([6, 3, 5, 8, 7])
z = zscore(x, ddof = 1)
z_formula = .... do it yourself
# To compare two vectors
print(np.c_[z, z_formula]) # concatenate vertically
範例 2： Given and , compute the correlation coefficient
where
Note:
import numpy as np
from scipy.stats import pearsonr
x = np.array([6, 3, 5, 8, 7])
y = np.array([2, 4, 3, 7, 6])
# coding for r by formula
..............
r_formula = .....
print('Correlation coefficient by formula is {:.4f}'.format(r_formula))
# directly use command from scipy.stats
r_sci = pearsonr(x, y)[0] # the first return value
print('Correlation coefficient by scipy.stats.pearsonr is %.4f' %r_sci)
範例 2： Draw the scatter plot of two samples . Use data from previous examples.
Note:
import numpy as np
import matplotlib.pyplot as plt
x = np.array([6, 3, 5, 8, 7])
y = np.array([2, 4, 3, 7, 6])
size, color = 50, 'g'
plt.scatter(x, y, s = size, c = color)
# arrange axis range for better look
plt.axis([0, 10, 0,10])
plt.xlabel('x'), plt.ylabel('y')
plt.show()
範例 3： Load data from “txt” file; draw the scatter plot for the samples; compute the correlation coefficient r as well.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
data_dir = '../Data/'
D = np.loadtxt(data_dir + 'data1.txt', comments='%')
x, y = D[:, 0], D[:, 1]
r = pearsonr(x, y)[0]
plt.scatter(x, y, s = 50, c = 'b', alpha = 0.5)
plt.axis([x.min()-1, x.max()+1, y.min()-1, y.max()+1])
plt.title('r = %.4f' %r)
plt.xlabel('x'), plt.ylabel('y')
plt.show()
練習： In the previous example, add a linear polynomial function to fit the data in the scatter plot.
Note :
from numpy.polynomial import polynomial
.... previous codes here
coef = polynomial.polyfit(x, y, 1)
.... add your codes below
....
範例 4： Read Iris data from an EXCEL file and draw a scatter plot and a bar plot from the data and column names.
Note :
import pandas as pd
import matplotlib.pyplot as plt
file_dir = '../Data/'
df = pd.read_excel(file_dir + 'Iris.xls', \
index_col = None, header = 0)
print(df) # the contents
print(df.info()) # information for columns
print(df.describe()) # descriptive stats
fig, (ax1, ax2) = plt.subplots(
1, 2, figsize=(8, 3))
ax1.scatter(df['Sepal Length'], df['Sepal Width'], \
s = 150, c = 'r', alpha = 0.5)
ax1.set_xlabel(df.columns[0])
ax1.set_ylabel(df.columns[1])
fig.suptitle('The Iris Data') # the super title
df.mean().plot.bar(ax = ax2, ylabel = 'Mean Value')
plt.show()
練習： Use pandas module to download the EXCEL file TaiwanBank and draw a plot below.
Note:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.font_manager as mfm
...
Read EXCEL file
draw the line chart
prepare for the legend
rotate the xtick
...
# the following codes will change the font of text shown on the axes
font_path = "C:\WINDOWS\FONTS\MSJHL.TTC" # 微軟正黑體
prop = mfm.FontProperties(fname = font_path)
plt.legend(prop = prop)
參考： The following codes are copied from the blog “Data Viz with Python and R” that demonstrates the power of Pandas and the advanced technique for the associated scatter plot.
Note:
import pandas as pd
import matplotlib.pyplot as plt
penguins_data="https://raw.githubusercontent.com/datavizpyr/data/master/palmer_penguin_species.tsv"
# load penguns data with Pandas read_csv
df = pd.read_csv(penguins_data, sep="\t")
df = df.dropna() # drop NA data (missing data)
print(df.head()) # print out the first few data
plt.figure(figsize=(8,6))
sp_names = ['Adelie', 'Gentoo', 'Chinstrap']
scatter = plt.scatter(df.culmen_length_mm, \
df.culmen_depth_mm, alpha = 0.5, s = 150, \
c = df.species.astype('category').cat.codes)
plt.xlabel("Culmen Length", size=14)
plt.ylabel("Culmen Depth", size=14)
# add legend to the plot with names
plt.legend(handles = scatter.legend_elements()[0],
labels = sp_names,
title = "species")
plt.show()
範例 5： Use the penguin_species data in the previous example. Calculate the means of culmen_length and culmen_depth, and their covariance matrix for each category (specie).
Note :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
... codes in previous example ...
# compute the means of the three categories
for sp in sp_names :
cul_len = df.culmen_length_mm[df.species == sp]
cul_dep = df.culmen_depth_mm[df.species == sp]
plt.text(cul_len.mean(), cul_dep.mean(), 'X', color = 'r')
X = np.stack((cul_len, cul_dep), axis = 0)
cov = np.cov(X)
print('The covariance matrix of {}\n'.format(sp), cov)
練習： The penguin_species data can be categorized by “species”, “island” and “sex”. Try to do the same things as in the previous examples, i.e. draw the scatter plot, compute the means and covariance matrices for the category by “island” and by “sex”.