# 汪群超 Chun-Chao Wang

Dept. of Statistics, National Taipei University, Taiwan

# Lesson 3: Basic Math and Statistics Computations

### Objective :

1. Learn basic computation techniques by calculating some descriptive statistics in various ways.
2. Learn how to import data from .txt file
4. Process data by Pandas and Scipy modules
5. Scatter plots and its variants.
6. From scatter plot to polynomial fitted line

### Prerequisite : $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i, \;\; \bar{y} = \frac{1}{n}\sum_{i=1}^n y_i$ $\bar{\sigma}_x = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2}, \;\; \bar{\sigma}_y = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (y_i-\bar{y})^2}$

Note:

1. Use numpy ackage
2. Show different ways of computing the same statistics including the commands from package
3. Get familiar with the object method and module method, e.g. x.sum() vs. np.sum(x)
4. Learn two ways of inserting number in a string to print out.
import numpy as np

x = np.array([6, 3, 5, 8, 7])
y = np.array([2, 4, 3, 7, 6])

sum_x_formula = x + x + x + x + x
sum_x = x.sum() # numpy data's method

n = len(x) # sample size
x_bar_formula = sum_x / n
x_bar = x.mean()

var_x_formula_1 = ((x - x_bar)**2).sum() / (n-1) # element by element operation
var_x_formula_2 = np.sum((x - x_bar)**2) / (n-1)
var_x = x.var( ddof = 1) # ddof = 1: divided by N-1

std_x_formula = np.sqrt(var_x)
std_x = x.std( ddof = 1)

y_bar = y.mean()
std_y = y.std( ddof = 1)

print('The sample mean of x is {:.4f}'.format(x_bar))
print('The sample standard deviation of x is %.6f' %std_x) $z_i = \frac{x_i - \bar{x}}{\bar{\sigma}_x}$ for $i=1,2,\cdots, 5$

Note:

1. Compute $z = [z_1\;\; z_2 \;\; z_3 \;\; z_4 \;\; z_5]$ by formula.
2. Compare the result of applying command “zscore” from the scipy module (scipy.stats.zscore)
import numpy as np
from scipy.stats import zscore

x = np.array([6, 3, 5, 8, 7])
z = zscore(x, ddof = 1)
z_formula = .... do it yourself

# To compare two vectors
print(np.c_[z, z_formula]) # concatenate vertically $r = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}}$ where $S_{xy} = \sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})$ $S_{xx} = \sum_{i=1}^n (x_i-\bar{x})^2, \;\; S_{yy} = \sum_{i=1}^n (y_i-\bar{y})^2$

Note:

1. First, practice to code the formula for the correlation coefficient $r$.
2. Scipy provides the command “pearsonr” for $r$. Refer to document for the details, especially the return values.
3. Compare the results from (1) and (2) to double check the correctness of coding. On the other hand, to learn how to use a new command.
import numpy as np
from scipy.stats import pearsonr

x = np.array([6, 3, 5, 8, 7])
y = np.array([2, 4, 3, 7, 6])

# coding for r by formula
..............
r_formula = .....
print('Correlation coefficient by formula is {:.4f}'.format(r_formula))
# directly use command from scipy.stats
r_sci = pearsonr(x, y) # the first return value
print('Correlation coefficient by scipy.stats.pearsonr is %.4f' %r_sci)



Note:

1. This example uses matplotlib module for scatter plot.
2. Refer to the document for more attributes to enrich its presentation, e.g. the marker shape.
3. It is necessary to re-arrange the axis range for a better look.
import numpy as np
import matplotlib.pyplot as plt

x = np.array([6, 3, 5, 8, 7])
y = np.array([2, 4, 3, 7, 6])

size, color = 50, 'g'
plt.scatter(x, y, s = size, c = color)

# arrange axis range for better look
plt.axis([0, 10, 0,10])
plt.xlabel('x'), plt.ylabel('y')
plt.show()


Note :
2. Normally, data file is stored in a folder different from the current code. In this example, we use a relative path for its directory.
3. Use Numpy’s “loadtxt” command to load a txt file data1.txt. Note that the attribute comments=’%’ is used to prevent the comment lines from being loaded. Check the data file to see the comment line in there.
4. Another important attribute in “loadtxt” is “delimiter”. To correctly set up the necessary attributes, you need to open data file in advance.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pearsonr

data_dir = '../Data/'
x, y = D[:, 0], D[:, 1]
r = pearsonr(x, y)

plt.scatter(x, y, s = 50, c = 'b', alpha = 0.5)

plt.axis([x.min()-1, x.max()+1, y.min()-1, y.max()+1])
plt.title('r = %.4f' %r)
plt.xlabel('x'), plt.ylabel('y')
plt.show()


Note :

1. The Numpy function numpy.polynomial.polynomial.polyfit() provides least-squares fit of a polynomial to data. The return contains the coefficients of the fitted polynomial function.
2. Try to add the following codes in the previous codes to give the graph on the right hand side.
3. Practice to load each data file in the demo data. Watch closely the distribution of data and the corresponding correlation coefficient.
from numpy.polynomial import polynomial

.... previous codes here

coef = polynomial.polyfit(x, y, 1)

....



Note :

2. Learn to use Pandas module which is strong in dealing with EXCEL data.
3. Pandas can read EXCEL data, do some computations and draw various type of graphs (Refer to document).
4. Learn how to make a subplot and its details.
5. Practice to draw a scatter plot for column 3 and 4 with associated xlabel and ylabel.
import pandas as pd
import matplotlib.pyplot as plt

file_dir = '../Data/'
df = pd.read_excel(file_dir + 'Iris.xls', \
index_col = None, header = 0)

print(df) # the contents
print(df.info()) # information for columns
print(df.describe()) # descriptive stats

fig, (ax1, ax2) = plt.subplots(
1, 2, figsize=(8, 3))
ax1.scatter(df['Sepal Length'], df['Sepal Width'], \
s = 150, c = 'r', alpha = 0.5)
ax1.set_xlabel(df.columns)
ax1.set_ylabel(df.columns)
fig.suptitle('The Iris Data') # the super title
df.mean().plot.bar(ax = ax2, ylabel = 'Mean Value')
plt.show()


Note:

1. Try to investigate more about pandas.read_excel and its return data franme.
2. Learn how to change font of the text shown the axes and the rotation of xtick.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.font_manager as mfm

...
draw the line chart
prepare for the legend
rotate the xtick
...

# the following codes will change the font of text shown on the axes
font_path = "C:\WINDOWS\FONTS\MSJHL.TTC" # 微軟正黑體
prop = mfm.FontProperties(fname = font_path)
plt.legend(prop = prop)


Note:

1. Pandas can receive data from URL.
2. Pandas can take care of missing data (NA).
3. Matplotlib’s scatters data by its categories (associated with colors)
4. Demonstrate the legend for scatter plot.
5. Try to store data to an EXCEL file (pandas.to_excel)
import pandas as pd
import matplotlib.pyplot as plt

penguins_data="https://raw.githubusercontent.com/datavizpyr/data/master/palmer_penguin_species.tsv"
df = df.dropna() # drop NA data (missing data)
print(df.head()) # print out the first few data

plt.figure(figsize=(8,6))
scatter = plt.scatter(df.culmen_length_mm, \
df.culmen_depth_mm, alpha = 0.5, s = 150, \
c = df.species.astype('category').cat.codes)
plt.xlabel("Culmen Length", size=14)
plt.ylabel("Culmen Depth", size=14)
# add legend to the plot with names
plt.legend(handles = scatter.legend_elements(),
labels = sp_names,
title = "species")
plt.show()


Note :

1. Learn how to extract desired data from a data frame to compute the requested statistics, such as mean and covariance.
2. In the demonstration codes below, the mean pair of (culmen_length, culmen_depth) for each category are marked as ‘X’ on the scatter plot.
3. The covariance matrix of (culmen_length, culmen_depth) for each category is printed out.
4. Examine the covariance matrix and see whether the results are correct or not (check with the distribution for each category in the scatter plot)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

... codes in previous example ...

# compute the means of the three categories
for sp in sp_names :
cul_len = df.culmen_length_mm[df.species == sp]
cul_dep = df.culmen_depth_mm[df.species == sp]
plt.text(cul_len.mean(), cul_dep.mean(), 'X', color = 'r')
X = np.stack((cul_len, cul_dep), axis = 0)
cov = np.cov(X)
print('The covariance matrix of {}\n'.format(sp), cov)