汪群超 Chun-Chao Wang

Dept. of Statistics, National Taipei University, Taiwan

Lesson 3: Basic Math and Statistics Computations

Objective :

  1. Learn basic computation techniques by calculating some descriptive statistics in various ways.
  2. Learn how to import data from .txt file
  3. Download data from website
  4. Process data by Pandas and Scipy modules
  5. Scatter plots and its variants.
  6. From scatter plot to polynomial fitted line

Prerequisite :


範例 1: Given x = [6\;\; 3 \;\; 5 \;\; 8 \;\; 7] and y = [2\;\; 4 \;\; 3 \;\; 7 \;\; 6] , compute \bar{x}, \bar{y} and \bar{\sigma}_x, \bar{\sigma}_y where

\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i, \;\; \bar{y} = \frac{1}{n}\sum_{i=1}^n y_i

\bar{\sigma}_x = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2}, \;\; \bar{\sigma}_y = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (y_i-\bar{y})^2}

Note:

  1. Use numpy ackage
  2. Show different ways of computing the same statistics including the commands from package
  3. Get familiar with the object method and module method, e.g. x.sum() vs. np.sum(x)
  4. Learn two ways of inserting number in a string to print out.
import numpy as np

x = np.array([6, 3, 5, 8, 7])
y = np.array([2, 4, 3, 7, 6])

sum_x_formula = x[0] + x[1] + x[2] + x[3] + x[4]
sum_x = x.sum() # numpy data's method

n = len(x) # sample size
x_bar_formula = sum_x / n 
x_bar = x.mean()

var_x_formula_1 = ((x - x_bar)**2).sum() / (n-1) # element by element operation
var_x_formula_2 = np.sum((x - x_bar)**2) / (n-1)
var_x = x.var( ddof = 1) # ddof = 1: divided by N-1

std_x_formula = np.sqrt(var_x)
std_x = x.std( ddof = 1)

y_bar = y.mean()
std_y = y.std( ddof = 1)

print('The sample mean of x is {:.4f}'.format(x_bar))
print('The sample standard deviation of x is %.6f' %std_x)

練習: Compute the zscore for the sample x = [6\;\; 3 \;\; 5 \;\; 8 \;\; 7] where

z_i = \frac{x_i - \bar{x}}{\bar{\sigma}_x} for i=1,2,\cdots, 5

Note:

  1. Compute z = [z_1\;\; z_2 \;\; z_3 \;\; z_4 \;\; z_5] by formula.
  2. Compare the result of applying command “zscore” from the scipy module (scipy.stats.zscore)
import numpy as np
from scipy.stats import zscore

x = np.array([6, 3, 5, 8, 7])
z = zscore(x, ddof = 1)
z_formula = .... do it yourself

# To compare two vectors
print(np.c_[z, z_formula]) # concatenate vertically

範例 2: Given x = [6\;\; 3 \;\; 5 \;\; 8 \;\; 7] and y = [2\;\; 4 \;\; 3 \;\; 7 \;\; 6] , compute the correlation coefficient

r = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}} where

S_{xy} = \sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})

S_{xx} = \sum_{i=1}^n (x_i-\bar{x})^2, \;\; S_{yy} = \sum_{i=1}^n (y_i-\bar{y})^2

Note:

  1. First, practice to code the formula for the correlation coefficient r.
  2. Scipy provides the command “pearsonr” for r. Refer to document for the details, especially the return values.
  3. Compare the results from (1) and (2) to double check the correctness of coding. On the other hand, to learn how to use a new command.
import numpy as np
from scipy.stats import pearsonr

x = np.array([6, 3, 5, 8, 7])
y = np.array([2, 4, 3, 7, 6])

# coding for r by formula
..............
r_formula = .....
print('Correlation coefficient by formula is {:.4f}'.format(r_formula))
# directly use command from scipy.stats
r_sci = pearsonr(x, y)[0] # the first return value
print('Correlation coefficient by scipy.stats.pearsonr is %.4f' %r_sci)


範例 2: Draw the scatter plot of two samples x, y. Use data from previous examples.

Note:

  1. This example uses matplotlib module for scatter plot.
  2. Refer to the document for more attributes to enrich its presentation, e.g. the marker shape.
  3. It is necessary to re-arrange the axis range for a better look.
import numpy as np
import matplotlib.pyplot as plt 

x = np.array([6, 3, 5, 8, 7])
y = np.array([2, 4, 3, 7, 6])

size, color = 50, 'g'
plt.scatter(x, y, s = size, c = color)

# arrange axis range for better look
plt.axis([0, 10, 0,10])
plt.xlabel('x'), plt.ylabel('y')
plt.show()


範例 3: Load data from “txt” file; draw the scatter plot for the samples; compute the correlation coefficient r as well.



Note :
  1. Download the demo data here to work with this example.
  2. Normally, data file is stored in a folder different from the current code. In this example, we use a relative path for its directory.
  3. Use Numpy’s “loadtxt” command to load a txt file data1.txt. Note that the attribute comments=’%’ is used to prevent the comment lines from being loaded. Check the data file to see the comment line in there.
  4. Another important attribute in “loadtxt” is “delimiter”. To correctly set up the necessary attributes, you need to open data file in advance.
import numpy as np
import matplotlib.pyplot as plt 
from scipy.stats import pearsonr

data_dir = '../Data/'
D = np.loadtxt(data_dir + 'data1.txt', comments='%')
x, y = D[:, 0], D[:, 1] 
r = pearsonr(x, y)[0] 

plt.scatter(x, y, s = 50, c = 'b', alpha = 0.5)

plt.axis([x.min()-1, x.max()+1, y.min()-1, y.max()+1])
plt.title('r = %.4f' %r)
plt.xlabel('x'), plt.ylabel('y')
plt.show()

練習: In the previous example, add a linear polynomial function to fit the data in the scatter plot.

Note :

  1. The Numpy function numpy.polynomial.polynomial.polyfit() provides least-squares fit of a polynomial to data. The return contains the coefficients of the fitted polynomial function.
  2. Try to add the following codes in the previous codes to give the graph on the right hand side.
  3. Practice to load each data file in the demo data. Watch closely the distribution of data and the corresponding correlation coefficient.
from numpy.polynomial import polynomial

.... previous codes here

coef = polynomial.polyfit(x, y, 1)

.... add your codes below 
....


範例 4: Read Iris data from an EXCEL file and draw a scatter plot and a bar plot from the data and column names.

Note :

  1. Use “Iris.xls” as an example. Download here
  2. Learn to use Pandas module which is strong in dealing with EXCEL data.
  3. Pandas can read EXCEL data, do some computations and draw various type of graphs (Refer to document).
  4. Learn how to make a subplot and its details.
  5. Practice to draw a scatter plot for column 3 and 4 with associated xlabel and ylabel.
import pandas as pd
import matplotlib.pyplot as plt

file_dir = '../Data/'
df = pd.read_excel(file_dir + 'Iris.xls', \
     index_col = None, header = 0)

print(df) # the contents
print(df.info()) # information for columns 
print(df.describe()) # descriptive stats

fig, (ax1, ax2) = plt.subplots(
    1, 2, figsize=(8, 3))
ax1.scatter(df['Sepal Length'], df['Sepal Width'], \
     s = 150, c = 'r', alpha = 0.5)
ax1.set_xlabel(df.columns[0])   
ax1.set_ylabel(df.columns[1])  
fig.suptitle('The Iris Data') # the super title
df.mean().plot.bar(ax = ax2, ylabel = 'Mean Value')
plt.show()

練習: Use pandas module to download the EXCEL file TaiwanBank and draw a plot below.

Note:

  1. Try to investigate more about pandas.read_excel and its return data franme.
  2. Learn how to change font of the text shown the axes and the rotation of xtick.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.font_manager as mfm

...
Read EXCEL file 
draw the line chart 
prepare for the legend
rotate the xtick
...

# the following codes will change the font of text shown on the axes
font_path = "C:\WINDOWS\FONTS\MSJHL.TTC" # 微軟正黑體
prop = mfm.FontProperties(fname = font_path)
plt.legend(prop = prop)

參考: The following codes are copied from the blog “Data Viz with Python and R” that demonstrates the power of Pandas and the advanced technique for the associated scatter plot.

Note:

  1. Pandas can receive data from URL.
  2. Pandas can take care of missing data (NA).
  3. Matplotlib’s scatters data by its categories (associated with colors)
  4. Demonstrate the legend for scatter plot.
  5. Try to store data to an EXCEL file (pandas.to_excel)
import pandas as pd
import matplotlib.pyplot as plt

penguins_data="https://raw.githubusercontent.com/datavizpyr/data/master/palmer_penguin_species.tsv"
# load penguns data with Pandas read_csv
df = pd.read_csv(penguins_data, sep="\t")
df = df.dropna() # drop NA data (missing data)
print(df.head()) # print out the first few data

plt.figure(figsize=(8,6))
sp_names = ['Adelie', 'Gentoo', 'Chinstrap']
scatter = plt.scatter(df.culmen_length_mm, \
    df.culmen_depth_mm, alpha = 0.5, s = 150, \
        c = df.species.astype('category').cat.codes)
plt.xlabel("Culmen Length", size=14)
plt.ylabel("Culmen Depth", size=14)
# add legend to the plot with names
plt.legend(handles = scatter.legend_elements()[0], 
           labels = sp_names,
           title = "species")
plt.show()

範例 5: Use the penguin_species data in the previous example. Calculate the means of culmen_length and culmen_depth, and their covariance matrix for each category (specie).

Note :

  1. Learn how to extract desired data from a data frame to compute the requested statistics, such as mean and covariance.
  2. In the demonstration codes below, the mean pair of (culmen_length, culmen_depth) for each category are marked as ‘X’ on the scatter plot.
  3. The covariance matrix of (culmen_length, culmen_depth) for each category is printed out.
  4. Examine the covariance matrix and see whether the results are correct or not (check with the distribution for each category in the scatter plot)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

... codes in previous example ...

# compute the means of the three categories
for sp in sp_names :
    cul_len = df.culmen_length_mm[df.species == sp]
    cul_dep = df.culmen_depth_mm[df.species == sp]
    plt.text(cul_len.mean(), cul_dep.mean(), 'X', color = 'r')
    X = np.stack((cul_len, cul_dep), axis = 0)
    cov = np.cov(X)
    print('The covariance matrix of {}\n'.format(sp), cov)



練習: The penguin_species data can be categorized by “species”, “island” and “sex”. Try to do the same things as in the previous examples, i.e. draw the scatter plot, compute the means and covariance matrices for the category by “island” and by “sex”.