School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, 97330, OR, USA.
Department of Biomedical Sciences, Oregon State University, 106 Dryden Hall, Corvallis, 97330, OR, USA.
BMC Bioinformatics. 2019 Feb 6;20(1):63. doi: 10.1186/s12859-019-2637-4.
We previously reported on CERENKOV, an approach for identifying regulatory single nucleotide polymorphisms (rSNPs) that is based on 246 annotation features. CERENKOV uses the xgboost classifier and is designed to be used to find causal noncoding SNPs in loci identified by genome-wide association studies (GWAS). We reported that CERENKOV has state-of-the-art performance (by two traditional measures and a novel GWAS-oriented measure, AVGRANK) in a comparison to nine other tools for identifying functional noncoding SNPs, using a comprehensive reference SNP set (OSU17, 15,331 SNPs). Given that SNPs are grouped within loci in the reference SNP set and given the importance of the data-space manifold geometry for machine-learning model selection, we hypothesized that within-locus inter-SNP distances would have class-based distributional biases that could be exploited to improve rSNP recognition accuracy. We thus defined an intralocus SNP "radius" as the average data-space distance from a SNP to the other intralocus neighbors, and explored radius likelihoods for five distance measures.
We expanded the set of reference SNPs to 39,083 (the OSU18 set) and extracted CERENKOV SNP feature data. We computed radius empirical likelihoods and likelihood densities for rSNPs and control SNPs, and found significant likelihood differences between rSNPs and control SNPs. We fit parametric models of likelihood distributions for five different distance measures to obtain ten log-likelihood features that we combined with the 248-dimensional CERENKOV feature matrix. On the OSU18 SNP set, we measured the classification accuracy of CERENKOV with and without the new distance-based features, and found that the addition of distance-based features significantly improves rSNP recognition performance as measured by AUPVR, AUROC, and AVGRANK. Along with feature data for the OSU18 set, the software code for extracting the base feature matrix, estimating ten distance-based likelihood ratio features, and scoring candidate causal SNPs, are released as open-source software CERENKOV2.
Accounting for the locus-specific geometry of SNPs in data-space significantly improved the accuracy with which noncoding rSNPs can be computationally identified.
我们之前报道了 CERENKOV,这是一种基于 246 种注释特征识别调控单核苷酸多态性(rSNP)的方法。CERENKOV 使用 xgboost 分类器,旨在用于发现全基因组关联研究(GWAS)确定的基因座中的因果非编码 SNPs。我们报告称,与其他九种用于识别功能非编码 SNPs 的工具相比,CERENKOV 在使用综合参考 SNP 集(OSU17,15331 个 SNP)进行比较时具有最先进的性能(通过两种传统衡量标准和一种新的面向 GWAS 的衡量标准 AVGRANK)。鉴于 SNP 在参考 SNP 集中按基因座分组,并且鉴于数据空间流形几何对于机器学习模型选择的重要性,我们假设基因座内 SNP 之间的距离会具有基于类别的分布偏差,可以利用这些偏差来提高 rSNP 识别的准确性。因此,我们将基因座内 SNP 的“半径”定义为从 SNP 到其他基因座内邻居的平均数据空间距离,并探索了五个距离度量的半径似然性。
我们将参考 SNP 集扩展到 39083 个(OSU18 集),并提取了 CERENKOV SNP 特征数据。我们计算了 rSNP 和对照 SNP 的半径经验似然率和似然密度,并发现 rSNP 和对照 SNP 之间存在显著的似然差异。我们拟合了五个不同距离度量的似然分布的参数模型,以获得十个对数似然特征,我们将这些特征与 248 维 CERENKOV 特征矩阵相结合。在 OSU18 SNP 集上,我们测量了有无新距离特征的 CERENKOV 的分类准确性,发现添加基于距离的特征可显著提高 rSNP 识别性能,衡量标准为 AUPVR、AUROC 和 AVGRANK。随着 OSU18 集的特征数据的发布,提取基本特征矩阵、估计十个基于距离的似然比特征以及评分候选因果 SNP 的软件代码也作为开源软件 CERENKOV2 发布。
考虑到数据空间中 SNP 的基因座特异性几何形状,大大提高了计算识别非编码 rSNP 的准确性。