汪群超 Chun-Chao Wang

Dept. of Statistics, National Taipei University, Taiwan

Lesson 6: Random numbers and the associated graphs

Objective:

  1. Learn how to generate random numbers (sample) from some famous discrete and continuous distributions.
  2. Learn different ways of generating random numbers.
  3. Learn the visual presentations of a random sample including histogram, boxplot, probability plot and empirical CDF.

Prerequisite:


範例 1:Histogram

Draw histogram of a random sample drawn from a normal distribution N(\mu, \sigma^2).

Note:

  1. This example demonstrate the how the numpy method and scipy function generate random sample.
  2. Has to know about “random seed”.
  3. The histogram can be created by counts or by frequency as shown below.
  4. Try on various sizes of random samples to see the differences on the shape of the histograms.
  5. The number of bins may alter the visual outlook of the histogram, that may lead to a wrong decision. Try to alter the number of bins especially for the middle size of sample.
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

# random number generator with random seed
rng = np.random.default_rng() # (seed=...)
x = rng.normal(loc = 0, scale = 1, size = 1000)
# x = norm.rvs(loc = 0, scale = 1, size = 1000)

fig, (ax1, ax2) = plt.subplots(2, 1)
# typical histogram by counts
ax1.hist(x, bins = 10, alpha = 0.6, color = 'b', edgecolor = 'y', linewidth = 1)
# histogram by frequency 
ax2.hist(x, bins = 10, alpha = 0.6, color = 'b', edgecolor = 'y', linewidth = 1, density = True, rwidth = 0.9)
plt.suptitle('Histogram by counts and frequency')
plt.show()


範例 2:Boxplot

Draw boxplots of two random samples drawn from two different normal distribution of N(0, 1) and N(1, 2)

Note:

  1. This example demonstrates the generation of random number with and without random seed. The reader needs to print out a few random numbers and observe the difference between using and not using random seed.
  2. The display of Boxplot can be very rich and colorful. The codes below show a couple: the dictionary of “boxprops” and “flierprops”.
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

rng = np.random.default_rng(0) # seed is fixed
n = 1000 # sampe size
x1 = norm.rvs(size = n, random_state = rng) # standard normal
x2 = norm.rvs(loc = 1, scale = 2, size = n)
# run a few times and check the random numbers generated 
# with and without fixing the random seed.
# print(x1[0:3])
# print(x2[0:3])
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (9, 6))
ax1.boxplot(np.c_[x1, x2], notch = True, vert = True, labels = ['X1','X2'])
# plt.show()

# use boxplot properties
boxprops = dict(linestyle = '--', linewidth = 3, color = 'darkgoldenrod')
flierprops = dict(marker='o', markerfacecolor = 'green', 
    markersize = 8, linestyle = 'none') # define outliers
labels = ['N(0,1)','N(1,1)']
ax2.boxplot(np.c_[x1, x2], boxprops = boxprops,
    flierprops = flierprops, labels = labels)
plt.show()


範例 3:Probabilty plot

Draw a normal probability plot for random sample drawn from normal distribution and a right-skewed \chi^2 distribution, respectively.

Note:

  • A probability plot can be set for various distribution. When set for normal distribution, the plot is called “normal probability plot” that can be used to test whether a random sample is drawn form a normal distribution.
  • When set for distribution other than normal, it is necessary to add “sparams” for the necessary parameters. The example for \chi^2 distribution with degree of freedom 2 is shown (commented) below.
  • import numpy as np
    import scipy.stats as stats
    
    x1 = np.random.normal(loc = 2, scale = 5, size=100) 
    x2 = np.random.chisquare(df = 2, size = 100)
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (12, 4))
    stats.probplot(x1, dist = "norm", plot = ax1)
    stats.probplot(x2, dist = "norm", plot = ax2)
    # stats.probplot(x2, dist = "chi2", sparams = 2, plot = ax2)
    ax1.set_title('Data from Normal')
    ax2.set_title('Data from $\chi^2$')
    plt.suptitle('probability Plot')
    plt.show()
    

    範例 4:Empirical CDF

    Draw empirical CDF (ECDF) of a random sample drawn by a given distribution e.g. N(0,1). Then compare the ECDF with the real CDF and observe the difference when sample size is getting large.

    Note:

    1. The empirical CDF simulates the CDF of the population by random sample of size n.
    2. The scipy module provides function “cumfreq” (cumulative frequency over sorted sample) for ECDF. This is called “approach 1” in the following codes.
    3. The easiest method to simulate the CDF from the sample is to treat the probability of occurrence for each sample point is \frac{1}{n}. This method is called “approach 2” in the following codes.
    4. To see how ECDF resemble the real population CDF, the empirical-CDF and real-CDF are plotted on the same figure. The reader should change the sample size to get an idea of the resemblance according to sample size.
    import numpy as np
    from scipy.stats import norm
    from scipy.stats import cumfreq  # for ECDF
    
    n = 100
    sample = norm.rvs(size = n)
    # Approach 1
    num_bins = n # cumulative frequency over each sample
    res = cumfreq(sample, num_bins)
    # print(res.lowerlimit, res.binsize, res.cumcount.size)
    x = res.lowerlimit + np.linspace(0,
        res.binsize * res.cumcount.size, res.cumcount.size)
    
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize = (6, 9))
    # ax1.bar(x, res.cumcount/n, width=res.binsize)
    ax1.plot(x, res.cumcount/n, drawstyle = 'steps-pre')
    ax1.set_title('The Empirical CDF with {} steps'.format(num_bins), fontsize = 'small')
    ax1.grid(True)
    
    # Approach 2
    x = np.sort(sample)
    Y = np.arange(1, n+1) / n
    ax2.plot(x, Y, drawstyle = 'steps-pre', label = 'Empirical CDF')
    # draw the theoretical CDF of Z
    Y_ = norm.cdf(x)
    ax2.plot(x, Y_, color = 'r', linestyle = '--',
        linewidth = 5, alpha = 0.5, label = 'real CDF')
    ax2.legend(), plt.grid(True)
    ax2.set_title('The Empirical CDF with {} samples'.format(num_bins), fontsize = 'small')
    plt.show()
    
    

    練習:Sampling Distribution Y = X^2

    Let the random variable X be normally distributed with N(0, 1). If Y = X^2, the random variable Y is known to follow \chi^2 distribution with degree of freedom 1. Verify this well known theorem by generating random samples of size n from N(0, 1), squaring them and then drawing histogram, boxplot, probability plot and ECDF.

    Note:

    1. To get a good view of the histogram, it needs to adjust the bin numbers. When attached by the real PDF line, it needs to select appropriate ylim.

    練習:Sampling Distribution Y = X_1 + X_2, where X_1 \sim  \chi^2(2), X_2 \sim \chi^2(4)

    Use simulation data to verify that Y \sim \chi^2(6). Select an appropriate graph for the verification.

    Note:

    1. Simulate the sampling distribution for the sample size of n=30, 100, 500, 1000, respectively.

    練習:Sampling Distribution for Central Limit Theorem (CLT)

    The experiment:

    • Let X_1, X_2, \cdots, X_n \sim F(\mu, \sigma)
    • Let Y = (X_1 + X_2 + \cdots + X_n) / n
    • As n \longrightarrow \infty, \;\;Y \sim N(\mu, \sigma/\sqrt{n})
    The scenarios:
    • Let F be uniform, t(3), t(10), \chi^2(2), \beta(9,2)
    • Let n=5, 10, 30, 50, 100, 300, 500, 1000.
    The outcome:(for each F and n)
    1. The histogram of a random sample (size N = 10000) of Y and the expected pdf line of N(\mu, \sigma/\sqrt{n}). For example, the graph shown on the left hand side below comes from F=\chi^2(2), n=100, and N=10000.
    2. Compute the sample mean \hat{\mu} and sample standard deviation \hat{\sigma} of N random samples of Y, where N=10000. The graph on the right hand side below shows the comparison of \hat{\mu}, \hat{\sigma} with the theoretical counterpart.
    3. The reader needs patience to work out the details including the appropriate Y range, the xticklabels, etc.

    範例 5:Boxplot for multiple data sets

    Draw N boxplots in a figure for N data sets. In this example, the N data sets are drawn form N independent variables distributed normally. Each data set contains n data points.

    Note:

    1. Boxplot presents the descriptive statistics of a data set visually.
    2. It is often used to compare among multiple data sets.
    import matplotlib.pyplot as plt
    from scipy.stats import norm
    
    n, N = 20, 100
    X = norm.rvs(size=(N, n))
    plt.boxplot(X)
    plt.title('The boxplots of {} independent random samples from $N(0,1)$'.format(n))
    plt.show()
    
    

    習題:

    1. Choose three sampling distributions you have learned. Verify the theorems by simulations.