Suppr超能文献

一种基于词频逆文档频率(TF-IDF)的用于检测横向基因转移的新型无比对方法。

A novel alignment-free method for detection of lateral genetic transfer based on TF-IDF.

作者信息

Cong Yingnan, Chan Yao-Ban, Ragan Mark A

机构信息

Institute for Molecular Bioscience and ARC Centre of Excellence in Bioinformatics, The University of Queensland, St Lucia, Brisbane, QLD 4072, Australia.

School of Mathematics and Statistics, The University of Melbourne, Parkville, Melbourne, VIC 3010, Australia.

出版信息

Sci Rep. 2016 Jul 25;6:30308. doi: 10.1038/srep30308.

Abstract

Lateral genetic transfer (LGT) plays an important role in the evolution of microbes. Existing computational methods for detecting genomic regions of putative lateral origin scale poorly to large data. Here, we propose a novel method based on TF-IDF (Term Frequency-Inverse Document Frequency) statistics to detect not only regions of lateral origin, but also their origin and direction of transfer, in sets of hierarchically structured nucleotide or protein sequences. This approach is based on the frequency distributions of k-mers in the sequences. If a set of contiguous k-mers appears sufficiently more frequently in another phyletic group than in its own, we infer that they have been transferred from the first group to the second. We performed rigorous tests of TF-IDF using simulated and empirical datasets. With the simulated data, we tested our method under different parameter settings for sequence length, substitution rate between and within groups and post-LGT, deletion rate, length of transferred region and k size, and found that we can detect LGT events with high precision and recall. Our method performs better than an established method, ALFY, which has high recall but low precision. Our method is efficient, with runtime increasing approximately linearly with sequence length.

摘要

横向基因转移(LGT)在微生物进化中起着重要作用。现有的用于检测假定横向起源基因组区域的计算方法在处理大数据时扩展性较差。在此,我们提出一种基于词频 - 逆文档频率(TF-IDF)统计的新方法,用于在分层结构的核苷酸或蛋白质序列集中不仅检测横向起源区域,还能检测其起源和转移方向。该方法基于序列中k-mer的频率分布。如果一组连续的k-mer在另一个系统发育组中出现的频率明显高于其自身所在组,我们就推断它们是从第一个组转移到了第二个组。我们使用模拟数据集和实证数据集对TF-IDF进行了严格测试。对于模拟数据,我们在不同参数设置下测试了我们的方法,这些参数包括序列长度、组间和组内替换率以及横向基因转移后的情况、缺失率、转移区域长度和k值大小,结果发现我们能够以高精度和召回率检测横向基因转移事件。我们的方法比已有的ALFY方法表现更好,ALFY召回率高但精度低。我们的方法效率高,运行时间随序列长度近似线性增加。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ea88/4958984/c2803edc4a95/srep30308-f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验