HyDRA：通过混合距离分数排名聚合进行基因优先级排序。

Kim Minji, Farnoud Farzad, Milenkovic Olgica

Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.

Bioinformatics. 2015 Apr 1;31(7):1034-43. doi: 10.1093/bioinformatics/btu766. Epub 2014 Nov 18.

UNLABELLED

Gene prioritization refers to a family of computational techniques for inferring disease genes through a set of training genes and carefully chosen similarity criteria. Test genes are scored based on their average similarity to the training set, and the rankings of genes under various similarity criteria are aggregated via statistical methods. The contributions of our work are threefold: (i) first, based on the realization that there is no unique way to define an optimal aggregate for rankings, we investigate the predictive quality of a number of new aggregation methods and known fusion techniques from machine learning and social choice theory. Within this context, we quantify the influence of the number of training genes and similarity criteria on the diagnostic quality of the aggregate and perform in-depth cross-validation studies; (ii) second, we propose a new approach to genomic data aggregation, termed HyDRA (Hybrid Distance-score Rank Aggregation), which combines the advantages of score-based and combinatorial aggregation techniques. We also propose incorporating a new top-versus-bottom (TvB) weighting feature into the hybrid schemes. The TvB feature ensures that aggregates are more reliable at the top of the list, rather than at the bottom, since only top candidates are tested experimentally; (iii) third, we propose an iterative procedure for gene discovery that operates via successful augmentation of the set of training genes by genes discovered in previous rounds, checked for consistency.

MOTIVATION

Fundamental results from social choice theory, political and computer sciences, and statistics have shown that there exists no consistent, fair and unique way to aggregate rankings. Instead, one has to decide on an aggregation approach using predefined set of desirable properties for the aggregate. The aggregation methods fall into two categories, score- and distance-based approaches, each of which has its own drawbacks and advantages. This work is motivated by the observation that merging these two techniques in a computationally efficient manner, and by incorporating additional constraints, one can ensure that the predictive quality of the resulting aggregation algorithm is very high.

RESULTS

We tested HyDRA on a number of gene sets, including autism, breast cancer, colorectal cancer, endometriosis, ischaemic stroke, leukemia, lymphoma and osteoarthritis. Furthermore, we performed iterative gene discovery for glioblastoma, meningioma and breast cancer, using a sequentially augmented list of training genes related to the Turcot syndrome, Li-Fraumeni condition and other diseases. The methods outperform state-of-the-art software tools such as ToppGene and Endeavour. Despite this finding, we recommend as best practice to take the union of top-ranked items produced by different methods for the final aggregated list.

AVAILABILITY AND IMPLEMENTATION

The HyDRA software may be downloaded from: http://web.engr.illinois.edu/∼mkim158/HyDRA.zip.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

未标注

基因优先级排序是指通过一组训练基因和精心选择的相似性标准来推断疾病基因的一系列计算技术。测试基因根据其与训练集的平均相似度进行评分，并且通过统计方法汇总各种相似性标准下基因的排名。我们工作的贡献有三个方面：（i）首先，基于认识到不存在定义排名最优汇总的唯一方法，我们研究了许多新的汇总方法以及机器学习和社会选择理论中的已知融合技术的预测质量。在此背景下，我们量化了训练基因数量和相似性标准对汇总诊断质量的影响，并进行了深入的交叉验证研究；（ii）其次，我们提出了一种新的基因组数据汇总方法，称为HyDRA（混合距离分数排名汇总），它结合了基于分数和组合汇总技术的优点。我们还建议在混合方案中纳入一种新的顶部与底部（TvB）加权特征。TvB特征确保汇总在列表顶部比底部更可靠，因为只有顶部候选基因会进行实验测试；（iii）第三，我们提出了一种基因发现的迭代程序，该程序通过成功地用在前几轮中发现的基因扩充训练基因集并检查其一致性来运行。

动机

社会选择理论、政治和计算机科学以及统计学的基本结果表明，不存在一致、公平且唯一的排名汇总方法。相反，必须使用为汇总预先定义的一组期望属性来决定一种汇总方法。汇总方法分为两类，基于分数和基于距离的方法，每类方法都有其自身的缺点和优点。这项工作的动机是观察到以计算高效的方式合并这两种技术，并纳入额外的约束，可以确保所得汇总算法的预测质量非常高。

结果

我们在多个基因集上测试了HyDRA，包括自闭症、乳腺癌、结直肠癌、子宫内膜异位症、缺血性中风、白血病、淋巴瘤和骨关节炎。此外，我们使用与Turcot综合征、Li-Fraumeni病症和其他疾病相关的训练基因的顺序扩充列表，对胶质母细胞瘤、脑膜瘤和乳腺癌进行了迭代基因发现。这些方法优于诸如ToppGene和Endeavour等现有最先进的软件工具。尽管有这一发现，但我们建议最佳做法是将不同方法产生的排名靠前的项目合并到最终的汇总列表中。

可用性和实现

HyDRA软件可从以下网址下载：http://web.engr.illinois.edu/∼mkim158/HyDRA.zip。

补充信息

补充数据可在《生物信息学》在线获取。

相似文献

HyDRA: gene prioritization via hybrid distance-score rank aggregation.

Bioinformatics. 2015 Apr 1;31(7):1034-43. doi: 10.1093/bioinformatics/btu766. Epub 2014 Nov 18.

Inferring disease and gene set associations with rank coherence in networks.

Bioinformatics. 2011 Oct 1;27(19):2692-9. doi: 10.1093/bioinformatics/btr463. Epub 2011 Aug 8.

Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach.

Bioinformatics. 2007 Jul 1;23(13):1607-15. doi: 10.1093/bioinformatics/btm158. Epub 2007 May 5.

GPS: Identification of disease genes by rank aggregation of multi-genomic scoring schemes.

Genomics. 2019 Jul;111(4):612-618. doi: 10.1016/j.ygeno.2018.03.017. Epub 2018 Mar 28.

Prioritization of positional candidate genes using multiple web-based software tools.

Twin Res Hum Genet. 2007 Dec;10(6):861-70. doi: 10.1375/twin.10.6.861.

smallWig: parallel compression of RNA-seq WIG files.

Bioinformatics. 2016 Jan 15;32(2):173-80. doi: 10.1093/bioinformatics/btv561. Epub 2015 Sep 30.

MetaKTSP: a meta-analytic top scoring pair method for robust cross-study validation of omics prediction analysis.

Bioinformatics. 2016 Jul 1;32(13):1966-73. doi: 10.1093/bioinformatics/btw115. Epub 2016 Mar 2.

ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples.

BMC Bioinformatics. 2011 Oct 6;12:389. doi: 10.1186/1471-2105-12-389.

ToppGene Suite for gene list enrichment analysis and candidate gene prioritization.

Nucleic Acids Res. 2009 Jul;37(Web Server issue):W305-11. doi: 10.1093/nar/gkp427. Epub 2009 May 22.

Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification

引用本文的文献

Disease gene prediction with privileged information and heteroscedastic dropout.

Bioinformatics. 2021 Jul 12;37(Suppl_1):i410-i417. doi: 10.1093/bioinformatics/btab310.

EARN: an ensemble machine learning algorithm to predict driver genes in metastatic breast cancer.

BMC Med Genomics. 2021 May 7;14(1):122. doi: 10.1186/s12920-021-00974-3.

pBRIT: gene prioritization by correlating functional and phenotypic annotations through integrative data fusion.

Bioinformatics. 2018 Jul 1;34(13):2254-2262. doi: 10.1093/bioinformatics/bty079.

CRCDA--Comprehensive resources for cancer NGS data analysis.

Database (Oxford). 2015 Oct 8;2015. doi: 10.1093/database/bav092. Print 2015.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

HyDRA: gene prioritization via hybrid distance-score rank aggregation.

Bioinformatics. 2015 Apr 1;31(7):1034-43. doi: 10.1093/bioinformatics/btu766. Epub 2014 Nov 18.

Inferring disease and gene set associations with rank coherence in networks.

Bioinformatics. 2011 Oct 1;27(19):2692-9. doi: 10.1093/bioinformatics/btr463. Epub 2011 Aug 8.

Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach.

Bioinformatics. 2007 Jul 1;23(13):1607-15. doi: 10.1093/bioinformatics/btm158. Epub 2007 May 5.

GPS: Identification of disease genes by rank aggregation of multi-genomic scoring schemes.

Genomics. 2019 Jul;111(4):612-618. doi: 10.1016/j.ygeno.2018.03.017. Epub 2018 Mar 28.

Prioritization of positional candidate genes using multiple web-based software tools.

Twin Res Hum Genet. 2007 Dec;10(6):861-70. doi: 10.1375/twin.10.6.861.

smallWig: parallel compression of RNA-seq WIG files.

Bioinformatics. 2016 Jan 15;32(2):173-80. doi: 10.1093/bioinformatics/btv561. Epub 2015 Sep 30.

MetaKTSP: a meta-analytic top scoring pair method for robust cross-study validation of omics prediction analysis.

Bioinformatics. 2016 Jul 1;32(13):1966-73. doi: 10.1093/bioinformatics/btw115. Epub 2016 Mar 2.

ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples.

BMC Bioinformatics. 2011 Oct 6;12:389. doi: 10.1186/1471-2105-12-389.

ToppGene Suite for gene list enrichment analysis and candidate gene prioritization.

Nucleic Acids Res. 2009 Jul;37(Web Server issue):W305-11. doi: 10.1093/nar/gkp427. Epub 2009 May 22.

Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification

引用本文的文献

Disease gene prediction with privileged information and heteroscedastic dropout.

Bioinformatics. 2021 Jul 12;37(Suppl_1):i410-i417. doi: 10.1093/bioinformatics/btab310.

EARN: an ensemble machine learning algorithm to predict driver genes in metastatic breast cancer.

BMC Med Genomics. 2021 May 7;14(1):122. doi: 10.1186/s12920-021-00974-3.

pBRIT: gene prioritization by correlating functional and phenotypic annotations through integrative data fusion.

Bioinformatics. 2018 Jul 1;34(13):2254-2262. doi: 10.1093/bioinformatics/bty079.

CRCDA--Comprehensive resources for cancer NGS data analysis.

Database (Oxford). 2015 Oct 8;2015. doi: 10.1093/database/bav092. Print 2015.

HyDRA: gene prioritization via hybrid distance-score rank aggregation.

作者信息

机构信息

出版信息

UNLABELLED

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

未标注

动机

结果

可用性和实现

补充信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献