FastPval explained

16 12 月, 2020 in Uncategorized | No comments

https://academic.oup.com/bioinformatics/article/26/22/2897/227791

给定一群统计量的观测值，我们描述一下用FastPval为每个观测值赋予P值得算法。FastPval的P值计算分为两个步骤，并且利用了这些统计量的分布右尾来计算统计量（？）。

在第一个步骤，我们随机地从原始数据集O中采样出N个样本构成一个子集（为提升效率，N通常是O的百分之一的规模）。我们对N排序，并找到一个阈值 $s_c$ ，使得大于 $s_c$ 是N的top P portion（N和P都是用户设定的，N默认设置为100,000而P默认设置为0.001）。

得到这个阈值后，我们再扫描数据集O，把大于阈值 $s_c$ 的值放到集合 $M$ 中去，也对M排序，得到 $M$ 里的最大值 $s_m$ 。我们把排了序的N和M保存好，作为M1和M2两个model。

那么在第二个步骤，新来一个统计量s时，为计算它的P值，我们先把它和 $s_c$ 比较：如果 $s\leq s_c$ ，我们就在M1中计算它的P值，否则就在M2中计算它的P值。如果 $s\geq s_M$ ，这就意味着s超过了我们的采样范围，我们将使用理论分布来计算它的P值或者简单地将它的值设置为0（取决于用户偏好；如果采用normal distribution或者extreme value distribution的理论分布，其分布参数由N数据集估计）。

For simplicity, here we illustrate our method in a two-stage approach and use the right tail of the distribution to calculate the statistics. In the first stage, we randomly sample a subset N from the original large dataset O. N is usually less than one-hundredth of the size of O, thus saving processing time. We sort N and obtain a cutoff score S_c representing the top P portion of N. Both N and P are parameters specified by the users, and are set to N = 100 000 and P = 0.001 by the default. We then scan the original set and put scores greater than S_c into our second subset M, and we obtain the maximum score S_m in M. The two subsets N and M are sorted, saved, and serve as our two models (M1 and M2). To calculate the P-value for a new score S, we compare S with S_c. If S ≤ S_c, we will find its P-value in M1. Otherwise we use M2. If S > S_m, indicating S is out of our resampling score range, we use theoretical distribution to calculate its P-value or simply set the P-value to 0, at the user’s preference. The parameters of two theoretical distributions, normal and extreme value distributions, were obtained from dataset N.

No comments

Comments feed for this article

Trackback link: http://sjchen.top/wordpress/wp-trackback.php?p=119

Sijie’s Blog

FastPval explained

No comments

Reply 取消回复