HyDRA:通过混合距离分数排名聚合进行基因优先级排序。
HyDRA: gene prioritization via hybrid distance-score rank aggregation.
作者信息
Kim Minji, Farnoud Farzad, Milenkovic Olgica
机构信息
Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.
出版信息
Bioinformatics. 2015 Apr 1;31(7):1034-43. doi: 10.1093/bioinformatics/btu766. Epub 2014 Nov 18.
UNLABELLED
Gene prioritization refers to a family of computational techniques for inferring disease genes through a set of training genes and carefully chosen similarity criteria. Test genes are scored based on their average similarity to the training set, and the rankings of genes under various similarity criteria are aggregated via statistical methods. The contributions of our work are threefold: (i) first, based on the realization that there is no unique way to define an optimal aggregate for rankings, we investigate the predictive quality of a number of new aggregation methods and known fusion techniques from machine learning and social choice theory. Within this context, we quantify the influence of the number of training genes and similarity criteria on the diagnostic quality of the aggregate and perform in-depth cross-validation studies; (ii) second, we propose a new approach to genomic data aggregation, termed HyDRA (Hybrid Distance-score Rank Aggregation), which combines the advantages of score-based and combinatorial aggregation techniques. We also propose incorporating a new top-versus-bottom (TvB) weighting feature into the hybrid schemes. The TvB feature ensures that aggregates are more reliable at the top of the list, rather than at the bottom, since only top candidates are tested experimentally; (iii) third, we propose an iterative procedure for gene discovery that operates via successful augmentation of the set of training genes by genes discovered in previous rounds, checked for consistency.
MOTIVATION
Fundamental results from social choice theory, political and computer sciences, and statistics have shown that there exists no consistent, fair and unique way to aggregate rankings. Instead, one has to decide on an aggregation approach using predefined set of desirable properties for the aggregate. The aggregation methods fall into two categories, score- and distance-based approaches, each of which has its own drawbacks and advantages. This work is motivated by the observation that merging these two techniques in a computationally efficient manner, and by incorporating additional constraints, one can ensure that the predictive quality of the resulting aggregation algorithm is very high.
RESULTS
We tested HyDRA on a number of gene sets, including autism, breast cancer, colorectal cancer, endometriosis, ischaemic stroke, leukemia, lymphoma and osteoarthritis. Furthermore, we performed iterative gene discovery for glioblastoma, meningioma and breast cancer, using a sequentially augmented list of training genes related to the Turcot syndrome, Li-Fraumeni condition and other diseases. The methods outperform state-of-the-art software tools such as ToppGene and Endeavour. Despite this finding, we recommend as best practice to take the union of top-ranked items produced by different methods for the final aggregated list.
AVAILABILITY AND IMPLEMENTATION
The HyDRA software may be downloaded from: http://web.engr.illinois.edu/∼mkim158/HyDRA.zip.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
未标注
基因优先级排序是指通过一组训练基因和精心选择的相似性标准来推断疾病基因的一系列计算技术。测试基因根据其与训练集的平均相似度进行评分,并且通过统计方法汇总各种相似性标准下基因的排名。我们工作的贡献有三个方面:(i)首先,基于认识到不存在定义排名最优汇总的唯一方法,我们研究了许多新的汇总方法以及机器学习和社会选择理论中的已知融合技术的预测质量。在此背景下,我们量化了训练基因数量和相似性标准对汇总诊断质量的影响,并进行了深入的交叉验证研究;(ii)其次,我们提出了一种新的基因组数据汇总方法,称为HyDRA(混合距离分数排名汇总),它结合了基于分数和组合汇总技术的优点。我们还建议在混合方案中纳入一种新的顶部与底部(TvB)加权特征。TvB特征确保汇总在列表顶部比底部更可靠,因为只有顶部候选基因会进行实验测试;(iii)第三,我们提出了一种基因发现的迭代程序,该程序通过成功地用在前几轮中发现的基因扩充训练基因集并检查其一致性来运行。
动机
社会选择理论、政治和计算机科学以及统计学的基本结果表明,不存在一致、公平且唯一的排名汇总方法。相反,必须使用为汇总预先定义的一组期望属性来决定一种汇总方法。汇总方法分为两类,基于分数和基于距离的方法,每类方法都有其自身的缺点和优点。这项工作的动机是观察到以计算高效的方式合并这两种技术,并纳入额外的约束,可以确保所得汇总算法的预测质量非常高。
结果
我们在多个基因集上测试了HyDRA,包括自闭症、乳腺癌、结直肠癌、子宫内膜异位症、缺血性中风、白血病、淋巴瘤和骨关节炎。此外,我们使用与Turcot综合征、Li-Fraumeni病症和其他疾病相关的训练基因的顺序扩充列表,对胶质母细胞瘤、脑膜瘤和乳腺癌进行了迭代基因发现。这些方法优于诸如ToppGene和Endeavour等现有最先进的软件工具。尽管有这一发现,但我们建议最佳做法是将不同方法产生的排名靠前的项目合并到最终的汇总列表中。
可用性和实现
HyDRA软件可从以下网址下载:http://web.engr.illinois.edu/∼mkim158/HyDRA.zip。
补充信息
补充数据可在《生物信息学》在线获取。