Bootstrap自助抽样法的原理、应用与python实现-Toy模板网

这篇具有很好参考价值的文章主要介绍了Bootstrap自助抽样法的原理、应用与python实现。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

bootstrap抽样,bootstrap,概率论,python,Powered by 金山文档

概念

Bootstrap自助抽样和交叉验证（Cross-Validation）一样也是一种重抽样（resampling）方法，它可以帮助近似得到统计量估计量的分布。

优点

帮助估计统计量估计量的方差①

假设有 bootstrap抽样,bootstrap,概率论,python,Powered by 金山文档（其中T是分布的函数）的估计量(X1, ... ,Xn)，一般来说，要评价的准确性（accuracy），需要计算其均方误差（MSE ，Mean Squared Error）：

当样本量n较大时，经验分布函数 bootstrap抽样,bootstrap,概率论,python,Powered by 金山文档会趋近于实际分布，因此的估计量自然是，因此上式的前半部的可写为：

接下来计算后半部分 bootstrap抽样,bootstrap,概率论,python,Powered by 金山文档的方差，根据公式有：

由于上式的和中有 bootstrap抽样,bootstrap,概率论,python,Powered by 金山文档项，因此计算此式并不合理，即便样本量小到.

考虑到 bootstrap抽样,bootstrap,概率论,python,Powered by 金山文档会在X1, ... ,Xn每个数据点上都乘以1/n，所以从中取任何观测值和直接在原始数据中取值一样。

Notice that puts mass 1/n at each data point X 1 , ... ,X n . Therefore, rawing an observation from is equivalent to drawing one point at random from the original data set. ②

因此，解决上述问题可以通过从 bootstrap抽样,bootstrap,概率论,python,Powered by 金山文档中取m个大小为n的随机样本，并根据每个分别样本求（共m个），使用这m个数据的样本方差作为估计量：

这里的m个样本被称为是bootstrap样本（bootstrap samples）或者重抽样样本（resamples），它们的均值称为bootstrap均值（bootstrap mean），bootstrap标准误（bootstrap standard error）则为（下文记为 bootstrap抽样,bootstrap,概率论,python,Powered by 金山文档）：

应用与步骤

综上所述，bootstrap常用于：

统计量的标准误差

未知参数的置信区间

假设检验的p值

进行步骤

有放回的重抽样n个样本；

根据1的样本计算统计量；

将1和2重复m次，得到bootstrap样本，再计算其样本方差或者样本标准误.

The distribution of statistics(bootstrap samples) in 3 is called a bootstrap distribution, which gives information about the shape, center, and spread of the sampling distribution of the statistic.

更多应用

假设检验内容见③

既然知道了统计量的标准误，就可以进一步计算bootstrap置信区间（bootstrap confidence interval），主要有3种（证明见②）：

- 正态置信区间（The Normal Interval）

bootstrap抽样,bootstrap,概率论,python,Powered by 金山文档

*该区间仅在 bootstrap抽样,bootstrap,概率论,python,Powered by 金山文档接近于正态分布时准确（如样本均值）。

2. 枢轴量置信区间（Pivotal Intervals）

定义枢轴量为 bootstrap抽样,bootstrap,概率论,python,Powered by 金山文档 bootstrap枢轴量置信区间为：

bootstrap抽样,bootstrap,概率论,python,Powered by 金山文档

其中， bootstrap抽样,bootstrap,概率论,python,Powered by 金山文档为bootstrap样本，为bootstrap样本中的α/2分位数。

3. 百分位数置信区间（Percentile Intervals）

bootstrap抽样,bootstrap,概率论,python,Powered by 金山文档

Python实现

population = list(np.random.normal(loc =2.0, scale= 2.0, size = 2000))  # 产生总体数据

result = pd.DataFrame({                                  # 产生bootstrap samples的以及模拟数据的容器
    "sample_time": [10,50,100,500,1000,5000,10000,50000,100000,500000,1000000],
    "sample_mean": [NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN],
    "sample_mean_std": [NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]
})

sampl = random.sample(population,36)  # 抽取36个初始样本

for _ in range(len(result["sample_mean"])):
    btstp_stat = []
    for i in range(int(result["sample_time"][_])): # 指定抽样次数
        bst_sampl = list(np.random.choice(sampl,size=36, replace=True))  # 重抽样
        btstp_stat.append(np.mean(bst_sampl))   # 产生抽样统计量列表
    result.loc[_,"sample_mean"] = np.mean(btstp_stat)
    result.loc[_,"sample_mean_std"] = np.std(btstp_stat,ddof=1)   # 将结果写入bootstrap结果数据框里
    print(len(btstp_stat))
    btstp_stat.clear()

print(result)  #打印结果

从结果中可以看到，随着bootstrap 抽样次数的增加（从10次到100万次），bootstrap mean 和bootstrap standard error渐趋收敛。