Department of Cell Biology and Genetics, School of Basic Medical Sciences, Shenzhen University Health Science Center, Shenzhen 518060, China.
Epigenomics and Computational Biology Lab, Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24060, USA.
Bioinformatics. 2017 Sep 1;33(17):2631-2641. doi: 10.1093/bioinformatics/btx294.
In genome-wide rate comparison studies, there is a big challenge for effective identification of an appropriate number of significant features objectively, since traditional statistical comparisons without multi-testing correction can generate a large number of false positives while multi-testing correction tremendously decreases the statistic power.
In this study, we proposed a new exact test based on the translation of rate comparison to two binomial distributions. With modeling and real datasets, the exact binomial test (EBT) showed an advantage in balancing the statistical precision and power, by providing an appropriate size of significant features for further studies. Both correlation analysis and bootstrapping tests demonstrated that EBT is as robust as the typical rate-comparison methods, e.g. χ 2 test, Fisher's exact test and Binomial test. Performance comparison among machine learning models with features identified by different statistical tests further demonstrated the advantage of EBT. The new test was also applied to analyze the genome-wide somatic gene mutation rate difference between lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC), two main lung cancer subtypes and a list of new markers were identified that could be lineage-specifically associated with carcinogenesis of LUAD and LUSC, respectively. Interestingly, three cilia genes were found selectively with high mutation rates in LUSC, possibly implying the importance of cilia dysfunction in the carcinogenesis.
An R package implementing EBT could be downloaded from the website freely: http://www.szu-bioinf.org/EBT .
Supplementary data are available at Bioinformatics online.
在全基因组速率比较研究中,有效识别适当数量的显著特征是一个巨大的挑战,因为传统的统计比较如果没有多重检验校正,可能会产生大量的假阳性,而多重检验校正则会极大地降低统计功效。
在这项研究中,我们提出了一种新的基于速率比较到两个二项分布的翻译的精确检验。通过建模和真实数据集,精确二项式检验(EBT)在平衡统计精度和功效方面具有优势,为进一步的研究提供了适当数量的显著特征。相关性分析和自举检验都表明,EBT 与典型的速率比较方法(如卡方检验、Fisher 精确检验和二项式检验)一样稳健。用不同统计检验方法识别特征的机器学习模型的性能比较进一步证明了 EBT 的优势。该新检验还应用于分析肺腺癌(LUAD)和肺鳞状细胞癌(LUSC)两种主要肺癌亚型之间全基因组体细胞基因突变率的差异,鉴定出了一系列新的标记物,这些标记物可能分别与 LUAD 和 LUSC 的癌变具有谱系特异性相关。有趣的是,在 LUSC 中发现了三个纤毛基因,其突变率选择性地较高,这可能意味着纤毛功能障碍在癌变中的重要性。
可从网站免费下载实现 EBT 的 R 包:http://www.szu-bioinf.org/EBT。
补充数据可在《生物信息学》在线获取。