Roslin Institute, University of Edinburgh, Edinburgh EH25 9RG, UK.
University of Sheffield, Sheffield S10 2NT, UK.
Bioinformatics. 2022 Oct 31;38(21):4927-4933. doi: 10.1093/bioinformatics/btac621.
A common experimental output in biomedical science is a list of genes implicated in a given biological process or disease. The gene lists resulting from a group of studies answering the same, or similar, questions can be combined by ranking aggregation methods to find a consensus or a more reliable answer. Evaluating a ranking aggregation method on a specific type of data before using it is required to support the reliability since the property of a dataset can influence the performance of an algorithm. Such evaluation on gene lists is usually based on a simulated database because of the lack of a known truth for real data. However, simulated datasets tend to be too small compared to experimental data and neglect key features, including heterogeneity of quality, relevance and the inclusion of unranked lists.
In this study, a group of existing methods and their variations that are suitable for meta-analysis of gene lists are compared using simulated and real data. Simulated data were used to explore the performance of the aggregation methods as a function of emulating the common scenarios of real genomic data, with various heterogeneity of quality, noise level and a mix of unranked and ranked data using 20 000 possible entities. In addition to the evaluation with simulated data, a comparison using real genomic data on the SARS-CoV-2 virus, cancer (non-small cell lung cancer) and bacteria (macrophage apoptosis) was performed. We summarize the results of our evaluation in a simple flowchart to select a ranking aggregation method, and in an automated implementation using the meta-analysis by information content algorithm to infer heterogeneity of data quality across input datasets.
The code for simulated data generation and running edited version of algorithms: https://github.com/baillielab/comparison_of_RA_methods. Code to perform an optimal selection of methods based on the results of this review, using the MAIC algorithm to infer the characteristics of an input dataset, can be downloaded here: https://github.com/baillielab/maic. An online service for running MAIC: https://baillielab.net/maic.
Supplementary data are available at Bioinformatics online.
生物医学科学中的一个常见实验输出是与给定生物过程或疾病相关的基因列表。通过对回答相同或相似问题的一组研究进行排名聚合方法,可以将产生的基因列表进行组合,以找到共识或更可靠的答案。在使用排名聚合方法之前,需要针对特定类型的数据进行评估,以支持其可靠性,因为数据集的特性会影响算法的性能。由于缺乏真实数据的已知事实,因此通常基于模拟数据库对基因列表进行此类评估。然而,与实验数据相比,模拟数据集往往太小,并且忽略了关键特征,包括质量、相关性和未排名列表的异质性。
在这项研究中,使用模拟和真实数据比较了一组适合基因列表荟萃分析的现有方法及其变体。使用模拟数据来探索聚合方法的性能,作为模拟真实基因组数据常见情况的函数,使用 20000 个可能实体模拟各种质量、噪声水平和混合未排名和排名数据的异质性。除了使用模拟数据进行评估外,还使用 SARS-CoV-2 病毒、癌症(非小细胞肺癌)和细菌(巨噬细胞凋亡)的真实基因组数据进行了比较。我们总结了评估结果,以流程图的形式选择排名聚合方法,并以使用信息内容荟萃分析算法推断输入数据集质量异质性的自动实现形式呈现。
模拟数据生成和运行编辑版本算法的代码:https://github.com/baillielab/comparison_of_RA_methods。可在此处下载用于根据此评论的结果进行方法最佳选择的代码,使用 MAIC 算法推断输入数据集的特征:https://github.com/baillielab/maic。运行 MAIC 的在线服务:https://baillielab.net/maic。
补充数据可在生物信息学在线获得。