Jaya Frederick R, Brito Barbara P, Darling Aaron E
Australian Institute for Microbiology & Infection, University of Technology Sydney, 15 Broadway, Ultimo, New South Wales 2007, Australia.
Ecology and Evolution, Research School of Biology, Australian National University, 134 Linnaeus Way, Acton, Australian Capital Territory 2600, Australia.
Virus Evol. 2023 Nov 23;9(2):vead066. doi: 10.1093/ve/vead066. eCollection 2023.
Recombination is a key evolutionary driver in shaping novel viral populations and lineages. When unaccounted for, recombination can impact evolutionary estimations or complicate their interpretation. Therefore, identifying signals for recombination in sequencing data is a key prerequisite to further analyses. A repertoire of recombination detection methods (RDMs) have been developed over the past two decades; however, the prevalence of pandemic-scale viral sequencing data poses a computational challenge for existing methods. Here, we assessed eight RDMs: PhiPack (Profile), 3SEQ, GENECONV, recombination detection program (RDP) (OpenRDP), MaxChi (OpenRDP), Chimaera (OpenRDP), UCHIME (VSEARCH), and gmos; to determine if any are suitable for the analysis of bulk sequencing data. To test the performance and scalability of these methods, we analysed simulated viral sequencing data across a range of sequence diversities, recombination frequencies, and sample sizes. Furthermore, we provide a practical example for the analysis and validation of empirical data. We find that RDMs need to be scalable, use an analytical approach and resolution that is suitable for the intended research application, and are accurate for the properties of a given dataset (e.g. sequence diversity and estimated recombination frequency). Analysis of simulated and empirical data revealed that the assessed methods exhibited considerable trade-offs between these criteria. Overall, we provide general guidelines for the validation of recombination detection results, the benefits and shortcomings of each assessed method, and future considerations for recombination detection methods for the assessment of large-scale viral sequencing data.
重组是塑造新型病毒群体和谱系的关键进化驱动力。如果不加以考虑,重组会影响进化估计或使其解释复杂化。因此,识别测序数据中的重组信号是进一步分析的关键前提。在过去二十年中已经开发了一系列重组检测方法(RDM);然而,大流行规模的病毒测序数据的普遍性给现有方法带来了计算挑战。在这里,我们评估了八种RDM:PhiPack(Profile)、3SEQ、GENECONV、重组检测程序(RDP)(OpenRDP)、MaxChi(OpenRDP)、Chimaera(OpenRDP)、UCHIME(VSEARCH)和gmos;以确定是否有任何一种适用于批量测序数据的分析。为了测试这些方法的性能和可扩展性,我们分析了一系列序列多样性、重组频率和样本大小的模拟病毒测序数据。此外,我们提供了一个分析和验证实证数据的实际示例。我们发现,RDM需要具有可扩展性,使用适合预期研究应用的分析方法和分辨率,并且对于给定数据集的属性(例如序列多样性和估计的重组频率)是准确的。对模拟数据和实证数据的分析表明,所评估的方法在这些标准之间表现出相当大的权衡。总体而言,我们提供了重组检测结果验证的一般指南、每种评估方法的优缺点,以及用于评估大规模病毒测序数据的重组检测方法的未来考虑因素。