Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.
PLoS One. 2013;8(1):e53112. doi: 10.1371/journal.pone.0053112. Epub 2013 Jan 7.
A common issue in bioinformatics is that computational methods often generate a large number of predictions sorted according to certain confidence scores. A key problem is then determining how many predictions must be selected to include most of the true predictions while maintaining reasonably high precision. In nuclear magnetic resonance (NMR)-based protein structure determination, for instance, computational peak picking methods are becoming more and more common, although expert-knowledge remains the method of choice to determine how many peaks among thousands of candidate peaks should be taken into consideration to capture the true peaks. Here, we propose a Benjamini-Hochberg (B-H)-based approach that automatically selects the number of peaks. We formulate the peak selection problem as a multiple testing problem. Given a candidate peak list sorted by either volumes or intensities, we first convert the peaks into [Formula: see text]-values and then apply the B-H-based algorithm to automatically select the number of peaks. The proposed approach is tested on the state-of-the-art peak picking methods, including WaVPeak [1] and PICKY [2]. Compared with the traditional fixed number-based approach, our approach returns significantly more true peaks. For instance, by combining WaVPeak or PICKY with the proposed method, the missing peak rates are on average reduced by 20% and 26%, respectively, in a benchmark set of 32 spectra extracted from eight proteins. The consensus of the B-H-selected peaks from both WaVPeak and PICKY achieves 88% recall and 83% precision, which significantly outperforms each individual method and the consensus method without using the B-H algorithm. The proposed method can be used as a standard procedure for any peak picking method and straightforwardly applied to some other prediction selection problems in bioinformatics. The source code, documentation and example data of the proposed method is available at http://sfb.kaust.edu.sa/pages/software.aspx.
生物信息学中的一个常见问题是,计算方法通常会生成大量根据某些置信分数排序的预测。然后,关键问题是确定必须选择多少预测才能包含大多数真实预测,同时保持合理的高精度。例如,在基于核磁共振(NMR)的蛋白质结构测定中,计算峰提取方法变得越来越普遍,尽管专家知识仍然是选择确定在数千个候选峰中应该考虑多少峰以捕获真实峰的首选方法。在这里,我们提出了一种基于 Benjamini-Hochberg(B-H)的方法来自动选择峰的数量。我们将峰选择问题表述为一个多重检验问题。给定一个按体积或强度排序的候选峰列表,我们首先将峰转换为[Formula: see text]-值,然后应用 B-H 算法自动选择峰的数量。所提出的方法在最先进的峰提取方法上进行了测试,包括 WaVPeak[1]和 PICKY[2]。与传统的基于固定数量的方法相比,我们的方法返回了更多的真实峰。例如,通过将 WaVPeak 或 PICKY 与所提出的方法结合使用,在从八个蛋白质中提取的 32 个光谱的基准集中,缺失峰的比率平均降低了 20%和 26%。来自 WaVPeak 和 PICKY 的 B-H 选择峰的共识分别达到 88%的召回率和 83%的精度,明显优于每个单独的方法和不使用 B-H 算法的共识方法。所提出的方法可以用作任何峰提取方法的标准程序,并可以直接应用于生物信息学中的一些其他预测选择问题。该方法的源代码、文档和示例数据可在 http://sfb.kaust.edu.sa/pages/software.aspx 获得。