Suppr超能文献

LINflow:一种计算流程,它将一种无比对方法与一种基于比对的方法相结合,以加速原核生物基因组相似性矩阵的生成。

LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes.

作者信息

Tian Long, Mazloom Reza, Heath Lenwood S, Vinatzer Boris A

机构信息

School of Plant and Environmental Sciences, Virginia Tech, Blacksburg, VA, USA.

Department of Computer Science, Virginia Tech, Blacksburg, VA, USA.

出版信息

PeerJ. 2021 Mar 24;9:e10906. doi: 10.7717/peerj.10906. eCollection 2021.

Abstract

BACKGROUND

Computing genomic similarity between strains is a prerequisite for genome-based prokaryotic classification and identification. Genomic similarity was first computed as Average Nucleotide Identity (ANI) values based on the alignment of genomic fragments. Since this is computationally expensive, faster and computationally cheaper alignment-free methods have been developed to estimate ANI. However, these methods do not reach the level of accuracy of alignment-based methods.

METHODS

Here we introduce LINflow, a computational pipeline that infers pairwise genomic similarity in a set of genomes. LINflow takes advantage of the speed of the alignment-free sourmash tool to identify the genome in a dataset that is most similar to a query genome and the precision of the alignment-based pyani software to precisely compute ANI between the query genome and the most similar genome identified by sourmash. This is repeated for each new genome that is added to a dataset. The sequentially computed ANI values are stored as Life Identification Numbers (LINs), which are then used to infer all other pairwise ANI values in the set. We tested LINflow on four sets, 484 genomes in total, and compared the needed time and the generated similarity matrices with other tools.

RESULTS

LINflow is up to 150 times faster than pyani and pairwise ANI values generated by LINflow are highly correlated with those computed by pyani. However, because LINflow infers most pairwise ANI values instead of computing them directly, ANI values occasionally depart from the ANI values computed by pyani. In conclusion, LINflow is a fast and memory-efficient pipeline to infer similarity among a large set of prokaryotic genomes. Its ability to quickly add new genome sequences to an already computed similarity matrix makes LINflow particularly useful for projects when new genome sequences need to be regularly added to an existing dataset.

摘要

背景

计算菌株之间的基因组相似性是基于基因组的原核生物分类和鉴定的前提条件。基因组相似性最初是基于基因组片段的比对计算为平均核苷酸同一性(ANI)值。由于这在计算上成本高昂,因此已经开发出了更快且计算成本更低的无比对方法来估计ANI。然而,这些方法未达到基于比对方法的准确性水平。

方法

在此,我们介绍LINflow,这是一种计算流程,可推断一组基因组中的成对基因组相似性。LINflow利用无比对的sourmash工具的速度来识别数据集中与查询基因组最相似的基因组,并利用基于比对的pyani软件的精度来精确计算查询基因组与sourmash识别出的最相似基因组之间的ANI。对于添加到数据集中的每个新基因组重复此操作。顺序计算的ANI值存储为生命识别号(LIN),然后用于推断该集合中所有其他成对的ANI值。我们在总共484个基因组的四个数据集上测试了LINflow,并将所需时间和生成的相似性矩阵与其他工具进行了比较。

结果

LINflow比pyani快150倍,并且LINflow生成的成对ANI值与pyani计算的值高度相关。然而,由于LINflow推断大多数成对的ANI值而不是直接计算它们,因此ANI值偶尔会与pyani计算的ANI值有所不同。总之,LINflow是一种快速且内存高效的流程,可推断大量原核生物基因组之间的相似性。它能够快速将新的基因组序列添加到已计算的相似性矩阵中,这使得LINflow对于需要定期将新的基因组序列添加到现有数据集中的项目特别有用。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验