https://academic.oup.com/bioinformatics/article/26/22/2897/227791
给定一群统计量的观测值,我们描述一下用FastPval为每个观测值赋予P值得算法。FastPval的P值计算分为两个步骤,并且利用了这些统计量的分布右尾来计算统计量(?)。
在第一个步骤,我们随机地从原始数据集O中采样出N个样本构成一个子集(为提升效率,N通常是O的百分之一的规模)。我们对N排序,并找到一个阈值s_c,使得大于s_c是N的top P portion(N和P都是用户设定的,N默认设置为100,000而P默认设置为0.001)。
得到这个阈值后,我们再扫描数据集O,把大于阈值 s_c的值放到集合M中去,也对M排序,得到M里的最大值s_m。我们把排了序的N和M保存好,作为M1和M2两个model。
那么在第二个步骤,新来一个统计量s时,为计算它的P值,我们先把它和s_c比较:如果s\leq s_c,我们就在M1中计算它的P值,否则就在M2中计算它的P值。如果s\geq s_M,这就意味着s超过了我们的采样范围,我们将使用理论分布来计算它的P值或者简单地将它的值设置为0(取决于用户偏好;如果采用normal distribution或者extreme value distribution的理论分布,其分布参数由N数据集估计)。
For simplicity, here we illustrate our method in a two-stage approach and use the right tail of the distribution to calculate the statistics. In the first stage, we randomly sample a subset N from the original large dataset O. N is usually less than one-hundredth of the size of O, thus saving processing time. We sort N and obtain a cutoff score Sc representing the top P portion of N. Both N and P are parameters specified by the users, and are set to N = 100 000 and P = 0.001 by the default. We then scan the original set and put scores greater than Sc into our second subset M, and we obtain the maximum score Sm in M. The two subsets N and M are sorted, saved, and serve as our two models (M1 and M2). To calculate the P-value for a new score S, we compare S with Sc. If S ≤ Sc, we will find its P-value in M1. Otherwise we use M2. If S > Sm, indicating S is out of our resampling score range, we use theoretical distribution to calculate its P-value or simply set the P-value to 0, at the user’s preference. The parameters of two theoretical distributions, normal and extreme value distributions, were obtained from dataset N.
近期评论